Poster Session
Poster Session 5 Pavilion 3
Pavilion 3
SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis
Shahriar Noroozizadeh ⋅ Xiaobin Shen ⋅ Jeremy Weiss ⋅ George H. Chen
Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from causal survival forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE‐Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE‐Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods.
A Relative Error-Based Evaluation Framework of Heterogeneous Treatment Effect Estimators
Jiayi Guo ⋅ Haoxuan Li ⋅ YE TIAN ⋅ Peng Wu
While significant progress has been made in heterogeneous treatment effect (HTE) estimation, the evaluation of HTE estimators remains underdeveloped. In this article, we propose a robust evaluation framework based on relative error, which quantifies performance differences between two HTE estimators. We first derive the key theoretical conditions on the nuisance parameters that are necessary to achieve a robust estimator of relative error. Building on these conditions, we introduce novel loss functions and design a neural network architecture to estimate nuisance parameters, thereby obtaining a robust estimation of relative error. We provide large sample properties of the proposed relative error estimator. Furthermore, beyond evaluation, we propose a new learning algorithm for HTE that leverages both the previously HTE estimators and the nuisance parameters learned through our neural network architecture. Extensive experiments demonstrate that our evaluation framework supports reliable comparisons across HTE estimators, and the proposed learning algorithm for HTE exhibits desirable performance.
Learning Exposure Mapping Functions for Inferring Heterogeneous Peer Effects
Shishir Adhikari ⋅ Sourav Medya ⋅ Elena Zheleva
Peer effect refers to the difference in counterfactual outcomes for a unit resulting from different levels of peer exposure, the extent to which the unit is exposed to the treatments, actions, or behaviors of its peers. In practice, peer exposure is typically captured through an explicitly defined exposure mapping function that aggregates peer treatments and outputs peer exposure. Exposure mapping functions range from simple functions like the number or fraction of treated friends to more sophisticated functions that allow for different peers to exert different degrees of influence. However, the true function is rarely known in practice and when the function is misspecified, this leads to biased causal effect estimation. To address this problem, the focus of our work is to move away from the need to explicitly define an exposure mapping function and instead introduce a framework that allows learning this function automatically. We develop EGONETGNN, a graph neural network (GNN), for heterogeneous peer effect estimation that automatically learns the appropriate exposure mapping function and allows for complex peer exposure mechanisms that involve not only peer treatments but also attributes of the local neighborhood, including node, edge, and structural attributes. We theoretically and empirically show that GNN models that use peer exposure based on the number or fraction of treated peers or learn peer exposure naively face difficulty accounting for such influence mechanisms. Our evaluation on synthetic and semi-synthetic network data shows that our method is more robust to different unknown underlying influence mechanisms when compared to state-of-the-art baselines.
CaTs and DAGs: Integrating Directed Acyclic Graphs with Transformers for Causally Constrained Predictions
Matthew Vowels ⋅ Mathieu Rochat ⋅ Sina Akbari
Artificial Neural Networks (ANNs), including fully-connected networks and transformers, are highly flexible and powerful function approximators, widely applied in fields like computer vision and natural language processing. However, their inability to inherently respect causal structures can limit their robustness, making them vulnerable to covariate shift and difficult to interpret/explain. This poses significant challenges for their reliability in real-world applications. In this paper, we introduce Causal Transformers (CaTs), a general model class designed to operate under predefined causal constraints, as specified by a Directed Acyclic Graph (DAG). CaTs retain the powerful function approximation abilities of traditional neural networks while adhering to the underlying structural constraints, improving robustness, reliability, and interpretability at inference time. This approach opens new avenues for deploying neural networks in more demanding, real-world scenarios where robustness and explainability is critical.
Causal Discovery in the Wild: A Voting-Theoretic Ensemble Approach
Vy Vo ⋅ Haoxuan Li ⋅ Mingming Gong
Causal discovery is a critical yet persistently challenging task across scientific domains. Despite years of significant algorithmic advances, existing methods still struggle with inconsistent outcomes due to reliance on untestable assumptions, sensitivity to data perturbations, and optimization constraints. To this end, ensemble-based causal discovery has been actively pursued, aiming to aggregate multiple structural predictions for increased stability and uncertainty estimation. However, current aggregation methods are largely heuristic, lacking theoretical guarantees and guidance on how ensemble design choices affect performance. This work is proposed to address there fundamental limitations. We introduce a principled voting-based framework for structural ensembling, establishing conditions under which the aggregated structure recovers the true causal graph. Our analysis yields a theoretically justified weighted voting mechanism that informs optimal choices regarding the number, competency, and diversity of causal discovery experts in the ensemble. Extensive experiments on synthetic and real-world datasets verify the robustness and effectiveness of our approach, offering a rigorous alternative to existing heuristic ensemble methods.
Independence Test for Linear Non-Gaussian Data and Applications in Causal Discovery
Yiqing Li ⋅ Xiaofei Wang ⋅ Boyang Sun ⋅ Yewei Xia ⋅ Kun Zhang
Independence testing involves determining whether two variables are independent based on observed samples, which is a fundamental problem in statistics and machine learning. Existing testing methods, such as HSIC, can theoretically detect broad forms of dependence, but may sacrifice statistical power when applied to limited samples with background knowledge of the distribution. In this paper, we focus on the linear non-Gaussian data, a widely supported model in scientific data analysis and causal discovery, where variables are linked linearly with noise terms that are non-Gaussian distributed. We provide a new theoretical characterization of independence in this case, showing that constancy of the conditional mean and variance is sufficient to guarantee independence under linear non-Gaussian models. Building on this result, we develop a kernel-based testing framework with provable asymptotic guarantees. Extensive experiments on synthetic and real-world datasets demonstrate that our method achieves higher power than existing approaches and significantly improves downstream causal discovery performance.
Score-based Greedy Search for Structure Identification of Partially Observed Causal Models
Xinshuai Dong ⋅ Ignavier Ng ⋅ Haoyue Dai ⋅ Jiaqi Sun ⋅ Xiangchen Song ⋅ Peter Spirtes ⋅ Kun Zhang
Identifying the structure of a partially observed causal system is essential to various scientific fields. Recent advances have focused on constraint-based causal discovery to solve this problem, and yet in practice these methods often face challenges related to multiple testing and error propagation. These issues could be mitigated by a score-based method and thus it has raised great attention whether there exists a score-based greedy search method that can handle the partially observed scenario. In this work, we propose the first score-based greedy search method for the identification of structure involving latent variables with identifiability guarantees. Specifically, we propose Generalized N Factor Model and establish the global consistency: the true structure including latent variables can be identified up to the Markov equivalence class by using score. We then design Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm for this class of model with well-defined operators, which search very efficiently over the graph space to find the optimal structure. Our experiments on both synthetic and real-life data validate the effectiveness of our method.
ALM-MTA: Front-Door Causal Multi-Touch Attribution Method for Creator-Ecosystem Optimization
Yuguang Liu ⋅ Luyao Xia ⋅ Hu Liu ⋅ Zhangxi Yan ⋅ Jian Liang ⋅ Han Li ⋅ Kun Gai
Consumption‑Drives‑Production (CDP) on social platforms aims to deliver interpretable incentive signals for creator‑ecosystem building and resource utilization improvement, which strongly relies on attributions. In large-scale and complex recommendation system, the absence of accurate labels together with unobserved confounding renders backdoor adjustments alone insufficient for reliable attribution. To address these problems, we propose Adversarial Learning Mediator based Multi‑Touch-Attribution (ALM-MTA), an extensible causal framework that leverages front-door identification with an adversarially learned mediator: a proxy trained to distillate outcome information to strengthen causal pathway from treatment to outcome and eliminate shortcut leakage. Then, we introduce contrastive learning that conditions front door marginalization on high match consumption upload pairs for ensuring positivity in large treatment spaces. To assess causality from non‑RCT logs, we also incorporate a non‑personalized bucketed protocol, estimating grouped uplift and computing AUUC over treatment clusters. Finally, we evaluate ALM-MTA performance using a real-world recommendation system with 400 million DAU (daily active users) and 30 billion samples. ALM-MTA has increased DAU with 0.04% and 0.6% of the daily active creators, with unit exposure efficiency increased by 670%. On causal utility, ALM-MTA achieves higher grouped AUUC than the SOTA in every propensity bucket, with a maximum gain of 0.070. In terms of accuracy, ALM-MTA improves upload AUC by 40% compared to SOTA. These results demonstrate that front -door deconfounding with adversarial mediator learning provides accurate, personalized and operationally efficient attribution for creator ecosystem optimization.
Dual-Branch Representations with Dynamic Gated Fusion and Triple-Granularity Alignment for Deep Multi-View Clustering
Wenyuan Kong ⋅ Zhibin Gu ⋅ Bing Li
Multi-view clustering seeks to exploit complementary information across different views to enhance clustering performance, where both semantic and structural information are crucial. However, existing approaches often bias toward one type of information while treating the other as auxiliary, overlooking that the reliability of these signals may vary across datasets and that semantic and structural cues can provide complementary and parallel guidance. As a result, such methods may face limitations in generalization and suboptimal clustering performance. To address these issues, we propose a novel method, Dual-branch Representations with dynamic gatEd fusion and triple-grAnularity alignMent (DREAM), for deep multi-view clustering. Specifically, DREAM disentangles semantic information via a Variational Autoencoder (VAE) branch, while simultaneously captures structure-aware features through a Graph Convolutional Network (GCN) branch. The resulting representations are dynamically integrated using a gated fusion module that leverages structural cues as complementary guidance, adaptively balancing semantic and structural contributions to produce clustering-oriented latent embeddings. To further improve robustness and discriminability, we introduce a triple-granularity feature alignment mechanism that enforces consistency across views, within individual samples, and intra-cluster, thereby preserving semantic-structural coherence while enhancing inter-cluster separability. Extensive experiments on benchmark datasets demonstrate that DREAM significantly outperforms SOTA approaches, highlighting the effectiveness of disentangled dual-branch encoding, adaptive gated fusion, and triple-granularity feature alignment.
Permutation-Consistent Variational Encoding for Incomplete Multi-View Multi-Label Classification
Chengliang Liu ⋅ Bo Li ⋅ Bob Zhang ⋅ Xiaoling Luo ⋅ Yabo Liu ⋅ Jie Wen
Incomplete multi-view multi-label learning is fundamentally an information integration problem under simultaneous view and label incompleteness. We introduce Permutation-Consistent Variational Encoding framework (PCVE) with an information bottleneck strategy, which learns variational representations capable of aggregating shared semantics across views while remaining robust to incompleteness. PCVE formulates a principled objective that maximizes a variational evidence lower bound to retain task-relevant information, and introduces a permutation-consistent regularization to encourage distributional consistency among representations that encode the same target semantics from different views. This regularization acts as an information alignment mechanism that suppresses view-private redundancy and mitigates over-alignment, thereby improving both sufficiency and consistency of the learned representations. To address missing labels, PCVE further incorporates a masked multi-label learning objective that leverages available supervision while modeling label dependencies. Extensive experiments across diverse benchmarks and missing ratios demonstrate consistent gains over state-of-the-art methods in multi-label classification, while enabling reliable inference of missing views without explicit imputation. Analyses corroborate that the proposed information-theoretic formulation improves cross-view semantic cohesion and preserves discriminative capacity, underscoring the effectiveness and generality of PCVE for incomplete multi-view multi-label learning.
Temporal Slowness in Central Vision Drives Semantic Object Learning
Timothy Schaumlöffel ⋅ Arthur Aubret ⋅ Gemma Roig ⋅ Jochen Triesch
Humans acquire semantic object representations from egocentric visual streams with minimal supervision, but the underlying mechanisms remain unclear. Importantly, the visual system only processes the center of its field of view with high resolution and it learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and a state-of-the-art gaze prediction model. We extract image crops around predicted gaze locations to train a time-contrastive Self-Supervised Learning model. Our results show that exploiting temporal slowness when learning from central visual field experience improves the encoding of different facets of object semantics. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially in conjunction with eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience. Our code will be made public upon acceptance. Code is available at https://github.com/t9s9/central-vision-ssl.
Self-Supervised Learning from Structural Invariance
Yipeng Zhang ⋅ Hafez Ghaemi ⋅ Jungyoon Lee ⋅ Shahab Bakhtiari ⋅ Eilif B Muller ⋅ Laurent Charlin
Joint-embedding self-supervised learning (SSL), the key paradigm for unsupervised representation learning from visual data, learns from invariances between semantically-related data pairs. We study the one-to-many mapping problem in SSL, where each datum may be mapped to multiple valid targets. This arises when data pairs come from naturally occurring generative processes, e.g., successive video frames. We show that existing methods struggle to flexibly capture this conditional uncertainty. As a remedy, we introduce a latent variable to account for this uncertainty and derive a variational lower bound on the mutual information between paired embeddings. Our derivation yields a simple regularization term for standard SSL objectives. The resulting method, which we call AdaSSL, applies to both contrastive and distillation-based SSL objectives, and we empirically show its versatility in causal representation learning, fine-grained image understanding, and world modeling on videos.
Difficulty–Diversity Collaborative Filtering for Data-Efficient LLM Fine-Tuning
Long Hoang ⋅ Wenxuan Zhang ⋅ Wei Lu
The performance of fine-tuned language models is heavily influenced by the quality and quantity of their fine-tuning data. While scaling laws suggest that larger models benefit from more data during pretraining, the Less-is-More hypothesis highlights that downstream fine-tuning often requires only a small but high-quality dataset to effectively elicit a model’s pretrained knowledge. However, identifying such premium data, particularly in terms of difficulty and diversity, typically relies on human expertise, and existing methods offer limited guidance for automatic selection from large unannotated corpora. This work presents a novel quantitative framework that formalizes the interplay between question difficulty and diversity, and introduces *Difficulty–Diversity Collaborative Filtering* (DDCF): an automated approach that tailors data selection to the unique characteristics of each language model via collaborative filtering. By leveraging a small seed dataset to predict correctness across a large unannotated corpus, our method reduces the annotation cost by $100-200\times$, while maintaining downstream performance comparable to full-corpus fine-tuning.
Reverse Distillation: Consistently Scaling Protein Language Model Representations
Darius Catrina ⋅ Christian Bepler ⋅ Samuel Sledzieski ⋅ Rohit Singh
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first $k$ dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
Beyond Instance-Level Alignment: Dual-Level Optimal Transport for Audio-Text Retrieval
Wenqi Guo ⋅ Shikui Tu ⋅ Lei Xu
Cross-modal matching tasks have achieved significant progress, yet remain limited by mini-batch subsampling and scarce labelled data. Existing objectives, such as contrastive losses, focus solely on instance-level alignment and implicitly assume that all feature dimensions contribute equally. Under small batches, this assumption amplifies noise, making alignment signals unstable and biased. We propose DART (Dual-level Alignment via Robust Transport), a framework that augments instance-level alignment with feature-level regularization based on the Unbalanced Wasserstein Distance (UWD). DART constructs reliability-weighted marginals that adaptively reweight channels according to their cross-modal consistency and variance statistics, highlighting stable and informative dimensions while down-weighting noisy or modality-specific ones. From a theoretical perspective, we establish concentration bounds showing that instance-level objectives scale with the maximum distance across presumed aligned pairs, while feature-level objectives are governed by the Frobenius norm of the transport plan. By suppressing unmatched mass and sparsifying the transport plan, DART reduces the effective transport diameter and tightens the bound, yielding greater robustness under small batches. Empirically, DART achieves state-of-the-art retrieval performance on three audio-text benchmarks, with particularly strong gains under scarce labels and small batch sizes.
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Xinping Zhao ⋅ Xinshuo Hu ⋅ Zifei Shan ⋅ Shouzheng Huang ⋅ Yao Zhou ⋅ Xin Zhang ⋅ Zetian Sun ⋅ zhenyu liu ⋅ dongfang li ⋅ Xinyuan Wei ⋅ Youcheng Pan ⋅ Yang Xiang ⋅ Meishan Zhang ⋅ Haofen Wang ⋅ Jun Yu ⋅ Baotian Hu ⋅ Min Zhang
Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2 from the Lychee-KaLM team, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters. The code, data, and models are available at https://kalm-embedding.github.io/.
LLM DNA: Tracing Model Evolution via Functional Representations
Zhaomin Wu ⋅ Haodong Zhao ⋅ Ziyang Wang ⋅ Jizhou Guo ⋅ Qian Wang ⋅ Bingsheng He
The explosive growth of large language models (LLMs) has created a vast but opaque landscape: millions of models exist, yet their evolutionary relationships through fine-tuning, distillation, or adaptation are often undocumented or unclear, complicating LLM management. Existing methods are limited by task specificity, fixed model sets, or strict assumptions about tokenizers or architectures. Inspired by biological DNA, we address these limitations by mathematically defining LLM DNA as a low-dimensional, bi-Lipschitz representation of functional behavior. We prove that LLM DNA satisfies inheritance and genetic determinism and establish its existence. Building on this theory, we derive a general, scalable, training-free pipeline for DNA extraction. In experiments across 305 LLMs, DNA aligns with prior studies on limited subsets and achieves superior or competitive performance on various tasks. Beyond these tasks, DNA comparisons uncover previously undocumented relationships among LLMs. We further construct the evolutionary tree of LLMs using phylogenetic algorithms, which align with shifts from encoder-decoder to decoder-only architectures, reflect temporal progression, and reveal distinct evolutionary speeds across LLM families.
An Efficient SE(p)-Invariant Transport Metric Driven by Polar Transport Discrepancy-based Representation
Junyi Lin ⋅ Dunyao Xue ⋅ Jun Yu ⋅ Hongteng Xu ⋅ Cheng Meng
We introduce SEINT, a novel Special Euclidean group-Invariant (SE(\emph{p})) metric for comparing probability distributions on $p$-dimensional measured Banach spaces. Existing SE(\emph{p})-invariant alignment methods often face high computational costs or lack metric guarantees. To overcome these limitations, we develop a polar transport discrepancy combined with distance convolution to extract SE(\emph{p})-invariant representations. These representations are then used to compute the alignment between two distributions via optimal transport. Theoretically, we prove that SEINT is a well-defined metric on the space of isometry classes of normed vector spaces. Beyond its inherent SE(\emph{p})-invariance, SEINT also supports cross-space distribution comparison. Computationally, SEINT aligns two samples of size $n$ with a complexity of just $\mathcal{O}(n\log n)$ to $\mathcal{O}(n^2)$. Extensive experiments validate its advantages: As a robust metric, it outperforms or matches existing SE(\emph{p})-invariant methods in classification and cross-space tasks under isometries. As a regularizer, it greatly enhances molecular generation performance across both pre-training and fine-tuning tasks, achieving state-of-the-art (SOTA) results on key benchmarks.
Soft Equivariance Regularization for Invariant Self-Supervised Learning
Joohyung Lee ⋅ Changhun Kim ⋅ Hyunsu Kim ⋅ Kwanhyung Lee ⋅ Juho Lee
Self-supervised learning (SSL) typically learns representations invariant to semantic-preserving augmentations (e.g., random crops and photometric jitter). While effective for recognition, enforcing strong invariance can suppress transformation-dependent structure that is useful for robustness to geometric perturbations and spatially sensitive transfer. A growing body of work, therefore, augments invariance-based SSL with equivariance objectives, but these objectives are often imposed on the same **final** (typically spatially-collapsed) representation. We empirically observe a trade-off in this coupled setting: pushing equivariance regularization toward deeper layers improves equivariance scores but degrades ImageNet-1k linear evaluation, motivating a layer-decoupled design. Motivated by this trade-off, we propose **Soft Equivariance Regularization (SER)**, a plug-in regularizer that decouples where invariance and equivariance are enforced: we keep the base SSL objective unchanged on the final embedding, while softly encouraging equivariance on an \emph{intermediate spatial token map} via analytically specified group actions $\rho_g$ (e.g., $90^{\circ}$ rotations, flips, and scaling) applied directly in feature space. SER learns/predicts no per-sample transformation codes/labels, requires no auxiliary transformation-prediction head, and adds only **1.008$\times$** training FLOPs. On ImageNet-1k ViT-S/16 pretraining, SER improves MoCo-v3 by **+0.84** Top-1 in linear evaluation under a strictly matched 2-view setting and consistently improves DINO and Barlow Twins; under matched view counts, SER achieves the best ImageNet-1k linear-eval Top-1 among the compared invariance+equivariance add-ons. SER further improves ImageNet-C/P by **+1.11/+1.22** Top-1 and frozen-backbone COCO detection by **+1.7** mAP. Finally, applying the same **layer-decoupling** recipe to existing invariance+equivariance baselines (e.g., EquiMod and AugSelf) improves their accuracy, suggesting layer decoupling as a general design principle for combining invariance and equivariance. Code is available at https://github.com/aitrics-chris/SER.
Frequency-Domain Better than Time-Domain for Causal Structure Recovery in Dynamical Systems on Networks
Mohammed Tuhin Rana ⋅ Mishfad Shaikh Veedu ⋅ James Melbourne ⋅ Murti Salapaka
Learning causal effects from data is a fundamental and well-studied problem across science, especially when the cause-effect relationship is static in nature. However, causal effect is less explored when there are dynamical dependencies, i.e., when dependencies exist between entities across time. In general, it is not possible to reconstruct the causal graph from data alone. The conventional static causal structure recovery algorithms employ tests such as the Fischer-z test and the chi-square test to assess the conditional independence (CI) of data which forms the basis for recovering Markov Equivalent Graphs (MEGs) wherein causal structure can be recovered partially. For data that are dynamically related, multivariate least square estimation, based on Wiener Filters (WFs) relying on second order statistics for estimating a data stream from other streams, provides a means of recovering influence structures of the directed network underlying the data. Here, WF based projections can be determined in time-domain or in frequency-domain; the question this article sets out to answer is which is better? Here, we obtain concentration bounds on the accuracy of the WF estimation in both time and frequency-based approaches. Exploiting the computation speed of Fast Fourier Transform (FFT), we establish that the frequency domain provides distinct advantages. Moreover, frequency domain projections involve complex numbers; we establish that the phase properties of the resulting estimates can be effectively leveraged for better recovery of the MEG in a large class of networks; the time-domain has no analogue of phase. Thus we report the "Wiener-Phase" algorithm provides the best accuracy as well as computational advantages. We validate the theoretical analysis with numerical results. Performance comparison with state of the art algorithms are also provided. Further, the proposed algorithms are validated on a real field dataset known as the "river-runoff" dataset collected from the online repository of CauseMe, and on measurement data from transistor based circuits.
Behavioral Embeddings of Programs: A Quasi-Dynamic Approach for Optimization Prediction
Haolin Pan ⋅ Dong Jinyuan ⋅ Hongbin Zhang ⋅ Hongyu Lin ⋅ Mingjie Xing ⋅ Yanjun WU
Learning effective numerical representations, or embeddings, of programs is a fundamental prerequisite for applying machine learning to automate and enhance compiler optimization. Prevailing paradigms, however, present a dilemma. Static representations, derived from source code or intermediate representation (IR), are efficient and deterministic but offer limited insight into how a program will behave or evolve under complex code transformations. Conversely, dynamic representations, which rely on runtime profiling, provide profound insights into performance bottlenecks but are often impractical for large-scale tasks due to prohibitive overhead and inherent non-determinism. This paper transcends this trade-off by proposing a novel quasi-dynamic framework for program representation. The core insight is to model a program's optimization sensitivity. We introduce the Program Behavior Spectrum, a new representation generated by probing a program's IR with a diverse set of optimization sequences and quantifying the resulting changes in its static features. To effectively encode this high-dimensional, continuous spectrum, we pioneer a compositional learning approach. Product Quantization is employed to discretize the continuous reaction vectors into structured, compositional sub-words. Subsequently, a multi-task Transformer model, termed PQ-BERT, is pre-trained to learn the deep contextual grammar of these behavioral codes. Comprehensive experiments on two representative compiler optimization tasks---Best Pass Prediction and -Oz Benefit Prediction---demonstrate that our method outperforms strong static baselines. Our code is publicly available at https://github.com/Panhaolin2001/PREP/.
Mitigating Noise Shift in Denoising Generative Models with Noise Awareness Guidance
Jincheng Zhong ⋅ Boyuan Jiang ⋅ Xin Tao ⋅ Pengfei Wan ⋅ Kun Gai ⋅ Mingsheng Long
Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models. Code is publicly available at https://github.com/KlingAIResearch/noise-awareness-guidance.
On the Alignment Between Supervised and Self-Supervised Contrastive Learning
Achleshwar Luthra ⋅ Priyadarsi Mishra ⋅ Tomer Galanti
Self-supervised contrastive learning (CL) has achieved remarkable empirical success, often producing representations that rival supervised pre-training on downstream tasks. Recent theory explains this by showing that the CL loss closely approximates a supervised surrogate, Negatives-Only Supervised Contrastive Learning (NSCL), as the number of classes grows. Yet this loss-level similarity leaves an open question: {\em Do CL and NSCL also remain aligned at the representation level throughout training, not just in their objectives?} We address this by analyzing the representation alignment of CL and NSCL models trained under shared randomness (same initialization, batches, and augmentations). First, we show that their induced representations remain similar: specifically, we prove that the similarity matrices of CL and NSCL stay close under realistic conditions. Our bounds provide high-probability guarantees on alignment metrics such as centered kernel alignment (CKA) and representational similarity analysis (RSA), and they clarify how alignment improves with more classes, higher temperatures, and its dependence on batch size. In contrast, we demonstrate that parameter-space coupling is inherently unstable: divergence between CL and NSCL weights can grow exponentially with training time. Finally, we validate these predictions empirically, showing that CL–NSCL alignment strengthens with scale and temperature, and that NSCL tracks CL more closely than other supervised objectives. This positions NSCL as a principled bridge between self-supervised and supervised learning.
Disentangled representation learning through unsupervised symmetry group discovery
Barthélémy Dang-Nhu ⋅ Louis Annabi ⋅ Sylvain ARGENTIERI
Symmetry-based disentangled representation learning leverages the group structure of environment transformations to uncover the latent factors of variation. Prior approaches to symmetry-based disentanglement have required strong prior knowledge of the symmetry group's structure, or restrictive assumptions about the subgroup properties. In this work, we remove these constraints by proposing a method whereby an embodied agent autonomously discovers the group structure of its action space through unsupervised interaction with the environment. We prove the identifiability of the true action group decomposition under minimal assumptions, and derive two algorithms: one for discovering the group decomposition from interaction data, and another for learning Linear Symmetry-Based Disentangled (LSBD) representations without assuming specific subgroup properties. Our method is validated on three environments exhibiting different group decompositions, where it outperforms existing LSBD approaches.
PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning
Daiki Yoshikawa ⋅ Takashi Matsubara
Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., *dog* $\preceq$ *mammal* $\preceq$ *animal*) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ *dog*, *car*). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose *PHyCLIP*, which employs an $\ell_1$-*P*roduct metric on a Cartesian product of *Hy*perbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Wei-Yao Wang ⋅ Kazuya Tateishi ⋅ Qiyu Wu ⋅ Shusuke Takahashi ⋅ Yuki Mitsufuji
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interests from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1\%–8.5\%) and five visual-interactive SCaR (15.2\%–20.3\%) tasks. The code, models, and benchmarks are available at https://github.com/sony/virtue.
There Was Never a Bottleneck in Concept Bottleneck Models
Antonio Almudévar ⋅ José Miguel Hernández Lobato ⋅ Alfonso Ortega
Deep learning representations are often difficult to interpret, which can hinder their deployment in sensitive applications. Concept Bottleneck Models (CBMs) have emerged as a promising approach to mitigate this issue by learning representations that support target task performance while ensuring that each component predicts a concrete concept from a predefined set. In this work, we argue that CBMs do not impose a true bottleneck: the fact that a component can predict a concept does not guarantee that it encodes only information about that concept. This shortcoming raises concerns regarding interpretability and the validity of intervention procedures. To overcome this limitation, we propose Minimal Concept Bottleneck Models (MCBMs), which incorporate an Information Bottleneck (IB) objective to constrain each representation component to retain only the information relevant to its corresponding concept. This IB is implemented via a variational regularization term added to the training loss. As a result, MCBMs yield more interpretable representations, support principled concept-level interventions, and remain consistent with probability-theoretic foundations.
Towards Understanding the Shape of Representations in Protein Language Models
Kosio Beshkov ⋅ Anders Malthe-Sorenssen
While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Fréchet radius and effective dimension of the SRV shape space follows a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between amino acids, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.
Verification of the Implicit World Model in a Generative Model via Adversarial Sequences
András Balogh ⋅ Márk Jelasity
Generative sequence models are typically trained on sample sequences from natural or formal languages. It is a crucial question whether—or to what extent—sample-based training is able to capture the true structure of these languages, often referred to as the "world model". Theoretical results indicate that we can hope for soundness at best, that is, generating valid sequences, but not necessarily all of them. However, it is still important to have practical tools that are able to verify whether a given sequence model is sound. In this study, we focus on chess, as it is a domain that provides enough complexity while having a simple rule-based world model. We propose adversarial sequence generation for verifying the soundness of the sequence model. Our adversaries generate valid sequences so as to force the sequence model to generate an invalid next move prediction. Apart from the falsification of soundness, this method is also suitable for a more fine-grained analysis of the failure modes and the effects of different choices during training. To demonstrate this, we propose a number of methods for adversarial sequence generation and evaluate the approach on a large set of chess models. We train models on random as well as high-quality chess games, using several training recipes. We find that none of the models are sound, but some training techniques and dataset choices are able to improve soundness remarkably. We also investigate the potential application of board state probes in both our training and attack methods. Our findings indicate that the extracted board states have no causal role in next token prediction in most of the models.
Uncertainty-driven Embedding Convolution
Sungjun Lim ⋅ Kangjun Noh ⋅ Youngjun Choi ⋅ Heeyoung Lee ⋅ Kyungwoo Song
Text embeddings are essential components in modern NLP pipelines. Although numerous embedding models have been proposed, no single model consistently dominates across domains and tasks. This variability motivates the use of ensemble techniques to combine complementary strengths. However, most existing ensemble methods operate on deterministic embeddings and fail to account for model-specific uncertainty, limiting their robustness and reliability in downstream applications. To address these limitations, we propose Uncertainty-driven Embedding Convolution (UEC). UEC first transforms deterministic embeddings into probabilistic ones in a post-hoc manner. It then computes adaptive ensemble coefficients based on embedding uncertainty, derived from a principled surrogate-loss formulation. Additionally, UEC employs an uncertainty-aware similarity function that directly incorporates uncertainty into the similarity scoring, providing a theoretically grounded and efficient surrogate to distributional distances. Extensive experiments on diverse benchmarks demonstrate that UEC consistently improves both performance and robustness by leveraging principled uncertainty modeling.
Anatomy-aware Representation Learning for Medical Ultrasound
Seok-Hwan Oh ⋅ Myeong-Gee Kim ⋅ Guil Jung ⋅ hyeonjik lee ⋅ Young-Min Kim ⋅ Sang-yun Kim ⋅ Hyuksool Kwon ⋅ Hyeon-min Bae
Diagnostic accuracy of ultrasound imaging is limited by qualitative variability and its reliance on the expertise of medical professionals. Such challenges increase demand for computer-aided diagnostic systems that enhance diagnostic accuracy and efficiency. However, the unique texture and structural attributes of ultrasound images, and the scarcity of large-scale ultrasound datasets hinder the effective application of conventional machine learning methodologies. To address the challenges, we propose Anatomy-aware Representation Learning (ARL), a novel self-supervised representation learning framework specifically designed for medical ultrasound imaging. ARL incorporates an anatomy-adaptive Vision Transformer (A-ViT). The A-ViT is parameterized, using the proposed large-scale medical ultrasound dataset, to provide anatomy-aware feature representations. Through extensive experiments across various ultrasound-based diagnostic tasks, including breast and thyroid cancer, cardiac view classification, and gallbladder tumor and COVID-19 identification, we demonstrate that ARL significantly outperforms existing self-supervised learning baselines. The experiments demonstrate the potential of ARL in advancing medical ultrasound diagnostics by providing anatomy-specific feature representation
In recent years, Contrastive Learning (CL) has garnered significant attention due to its efficacy across various domains, spanning from visual and textual modalities. A fundamental aspect of CL is aligning the representations of anchor instances with relevant positive samples while simultaneously separating them from negative ones. Prior studies have extensively explored diverse strategies for generating and sampling contrastive (i.e., positive/negative) pairs. Despite the empirical success, the theoretical understanding of the CL approach remains under-explored, leaving questions such as the rationale behind contrastive-pair sampling and its contributions to the model performance unclear. This paper addresses this gap by providing a comprehensive theoretical analysis from the angle of optimality conditions and introducing the SnaPhArd Contrast Learning (SPACL). Specifically, SPACL prioritizes hard positive and hard negative samples during constructing contrastive pairs and computing the contrastive loss, rather than treating all samples equally. Experimental results across two downstream tasks demonstrate that SPACL consistently outperforms or competes favorably with state-of-the-art methods, showcasing its robustness and efficacy. A comprehensive ablation study further examines the effectiveness of SPACL's individual components to verify the theoretic findings.
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Omri Uzan ⋅ Asaf Yehudai ⋅ Roi Pony ⋅ Eyal Shnarch ⋅ Ariel Gera
Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual tokens directly to image patches and achieving state-of-the-art performance on challenging benchmarks. Recent models relying on this paradigm have massively scaled the dimensionality of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model’s representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever’s query embedding using guidance from a complementary retriever’s scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows ColPali-based models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval
EditScore: Unlocking Online RL for Image Editing via High-Fidelity Reward Modeling
Xin Luo ⋅ Jiahao Wang ⋅ Chenyuan Wu ⋅ Shitao Xiao ⋅ Xiyan Jiang ⋅ Defu Lian ⋅ Jiajun Zhang ⋅ Dong Liu ⋅ Zheng Liu
Instruction-guided image editing has achieved remarkable progress, yet current models still face challenges with complex instructions and often require multiple samples to produce a desired result. Reinforcement Learning (RL) offers a promising solution, but its adoption in image editing has been severely hindered by the lack of a high-fidelity, efficient reward signal. In this work, we present a comprehensive methodology to overcome this barrier, centered on the development of a state-of-the-art, specialized reward model. We first introduce $\textbf{EditReward-Bench}$, a comprehensive benchmark to systematically evaluate reward models on editing quality. Guided by this benchmark, we develop $\textbf{EditScore}$, an efficient model to evaluate the quality of instruction-guided editing. Through meticulous data curation and filtering, EditScore effectively matches the performance of learning proprietary VLMs. Furthermore, coupled with an effective self-ensemble strategy tailored for the generative nature of EditScore, our largest variant even surpasses GPT-5 in the benchmark. We then demonstrate that a high-fidelity reward model is the key to unlocking online RL for image editing. Our experiments show that, while even the largest open-source VLMs fail to provide an effective learning signal, EditScore enables efficient and robust policy optimization. Applying our framework to a strong base model, OmniGen2, results in a final model that shows a substantial and consistent performance uplift. Overall, this work provides the first systematic path from benchmarking to reward modeling to RL training in image editing, showing that a high-fidelity, domain-specialized reward model is the key to unlocking the full potential of RL in this domain. Our code, models, and benchmark will be released publicly.
Covariate-Guided Clusterwise Linear Regression for Generalization to Unseen Data
Dohyun Bu ⋅ Hyunho Kim ⋅ JONGSEOK LEE
In many tabular regression tasks, the relationships between covariates and response can often be approximated as linear only within localized regions of the input space; a single global linear model therefore fails to capture these local relationships. Conventional Clusterwise Linear Regression (CLR) mitigates this issue by learning $K$ local regressors. However, existing algorithms either optimize latent binary indicators, (i) providing no explicit rule for assigning an $\textit{unseen}$ covariate vector to a cluster at test time, or rely on heuristic mixture of experts approaches, (ii) lacking convergence guarantees. To address these limitations, we propose $\textit{covariate-guided}$ CLR, an end-to-end framework that jointly learns an assignment function and $K$ linear regressors within a single gradient-based optimization loop. During training, a proxy network iteratively predicts coefficient vectors for inputs, and hard vector quantization assigns samples to their nearest codebook regressors. This alternating minimization procedure yields monotone descent of the empirical risk, converges under mild assumptions, and enjoys a PAC-style excess-risk bound. By treating the covariate data from all clusters as a single concatenated design matrix, we derive an $F$-test statistic from a nested linear model, quantitatively characterizing the effective model complexity. As $K$ varies, our method spans the spectrum from a single global linear model to instance-wise fits. Experimental results show that our method exactly reconstructs synthetic piecewise-linear surfaces, achieves accuracy comparable to strong black-box models on standard tabular benchmarks, and consistently outperforms existing CLR and mixture-of-experts approaches.
The effect of feature resolution on embedding dimension
Louise Beyers ⋅ Ruan van der Merwe
High-dimensional data can be compressed into lower-dimensional embeddings while retaining a relatively large amount of relevant information, a phenomenon which, despite its widespread use, we struggle to fully explain. In this post, we use a common property of datasets - a limit on the number of features per data point - to show how a slight uniform dependence between features can be exploited to reduce the required dimensions by at least a third, while sacrificing no information about the features. To do so, we introduce the concepts of dataset resolution and feature composition of a dataset, and analyse how a set of orderings of the dataset affects the types of partitions we can create of the dataset.
Enhanced Continual Learning of Vision-Language Models with Model Fusion
Haoyuan Gao ⋅ Zicong Zhang ⋅ Yuqi Wei ⋅ Linglan Zhao ⋅ Guilin Li ⋅ Yexin Li ⋅ Bo Wang ⋅ Linghe Kong ⋅ Weiran Huang
Vision-Language Models (VLMs) represent a significant breakthrough in artificial intelligence by integrating visual and textual modalities to achieve impressive zero-shot capabilities. However, VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks. Existing continual learning methods for VLMs face various limitations, often relying on additional reference datasets, compromising zero-shot performance, or being restricted to parameter-efficient fine-tuning scenarios. In this paper, we propose a novel Continual Decoupling-Unifying (ConDU) approach that pioneers the use of model fusion for continual learning in VLMs. Specifically, ConDU maintains a unified model along with task triggers and prototype sets, employing an iterative process of decoupling task experts for previous tasks and unifying them with the task expert for the newly learned task. Additionally, we introduce an inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task experts. Extensive experiments on the MTIL benchmark show that ConDU achieves up to a 2\% improvement in average performance across all seen tasks compared to state-of-the-art baselines, while also enhancing zero-shot capabilities relative to the original VLM. Our code is available at https://github.com/zhangzicong518/ConDU.
Command-V: Training-Free Representation Finetuning Transfer
Barry Wang ⋅ Avi Schwarzschild ⋅ Alexander Robey ⋅ Ali Payani ⋅ Charles Fleming ⋅ Mingjie Sun ⋅ Daphne Ippolito
Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation—costly steps that must be repeated for every architecture. In this work, we introduce ⌘V (Command-V), a backpropagation-free behavior transfer method that copies an existing residual representation adapter from a donor model and pastes its effect into an architecturally different recipient model. ⌘V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient’s activation space. This process does not require access to the original training data and needs minimal compute. In three case studies—safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning—⌘V matches the performance of direct finetuning while using orders of magnitude less resources.
Merge before Forget: A Single LoRA Continual Learning via Continual Merging
Fuli Qiao ⋅ Mehrdad Mahdavi
Parameter-efficient continual learning has emerged as a promising approach for large language models (LLMs) to mitigate catastrophic forgetting while enabling adaptation to new tasks. Current Low-Rank Adaptation (LoRA) continual learning techniques often retain and freeze previously learned LoRAs or generate data representations to overcome forgetting, typically utilizing these to support new LoRAs learn new tasks. However, these methods not only ignore growing computational memory with tasks and limited storage space but also suffer from potential task interference due to the lack of effective LoRA merging mechanisms. In this paper, we propose a novel continual learning method that orthogonally initializes and sequentially merges LoRAs updates into a single unified LoRA. Our method leverages orthogonal basis extraction from previously learned LoRA to initialize the learning of new tasks, further exploits the intrinsic asymmetry property of LoRA components by using a time-aware scaling mechanism to balance new and old knowledge during continual merging. Our approach maintains constant memory complexity with respect to the number of tasks, minimizes interference between past and new tasks via orthogonal basis initialization, and improves performance over asymmetric LoRA merging via adaptive scaling. We provide theoretical analysis to justify our design and conduct extensive experiments across diverse continual learning benchmarks using various LLMs, demonstrating the effectiveness and efficiency of our method.
Elastic Optimal Transport: Theory, Application, and Empirical Evaluation
Pei Yang ⋅ Yuhang Zhuang ⋅ Qi Tan
The classical optimal transport such as Kantorovich's optimal transport and partial optimal transport could be too restrictive in applications due to the full-mass or fixed-mass preservation constraints. To remedy this limitation, we propose elastic optimal transport (ELOT) which is distinctive from the classical optimal transport in its ability of adaptive-mass preserving. It aims to answer the problem of how to transport the probability mass adaptively between probability distributions, which is a fundamental topic in various areas of artificial intelligence. The strength of elastic optimal transport is its capability to transport adaptive-mass in the light of the geometry structure of the problem itself. As an application example in machine learning, we apply elastic optimal transport to both unsupervised domain adaptation and partial domain adaptation tasks. It adaptively transports masses from source domain to target domain by taking domain shift into consideration and respecting the ubiquity of noises or outliers in the data, in order to improve the generalization performance. The experiment results on the benchmarks show that ELOT significantly outperforms the state-of-the-art methods. As a powerful distribution matching tool, elastic optimal transport might be of interests to the broad areas such as artificial intelligence, healthcare, physics, operations research, urban science, etc. The source code is available in the supplementary material.
Forget Forgetting: Continual Learning in a World of Abundant Memory
Dongkyu Cho ⋅ Taesup Moon ⋅ Rumi Chunara ⋅ Kyunghyun Cho ⋅ Sungmin Cha
Continual learning (CL) has traditionally focused on minimizing exemplar memory, a constraint often misaligned with modern systems where GPU time, not storage, is the primary bottleneck. This paper challenges this paradigm by investigating a more realistic regime: one where memory is abundant enough to mitigate forgetting, but full retraining from scratch remains prohibitively expensive. In this practical "middle ground", we find that the core challenge shifts from stability to plasticity, as models become biased toward prior tasks and struggle to learn new ones. Conversely, improved stability allows simple replay baselines to outperform the state-of-the-art methods at a fraction of the GPU cost. To address this newly surfaced trade-off, we propose Weight Space Consolidation, a lightweight method that combines (1) rank-based parameter resets to restore plasticity with (2) weight averaging to enhance stability. Validated on both class-incremental learning with image classifiers and continual instruction tuning with large language models, our approach outperforms strong baselines while matching the low computational cost of replay, offering a scalable alternative to expensive full-retraining. These findings challenge long-standing CL assumptions and establish a new, cost-efficient baseline for real-world CL systems where exemplar memory is no longer the limiting factor.
Beyond Text-Only: Towards Multimodal Table Retrieval in Open-World
Da Li ⋅ Keping Bi ⋅ Jiafeng Guo ⋅ Wei Yuan ⋅ Fan Yang ⋅ Tingting Gao ⋅ Xueqi Cheng
Open-domain table retrieval aims to retrieve semantically relevant structured tables from a large-scale corpus in response to natural language queries. Unlike unstructured text, tables store information not only through their textual or numerical content but also through their structural properties, including hierarchical relationships between headers and cells, as well as complex spatial arrangements within the table layout. Existing methods predominantly treat table retrieval as a variant of text retrieval. They struggle to accurately preserve the rich structural semantics of diverse table formats during text serialization. Existing methods typically flatten tables into linear text sequences through row-wise or column-wise serialization, inadvertently discarding structural information. The problem becomes particularly acute when processing complex table layouts containing merged cells or irregular alignments, ultimately compromising retrieval performance. Moreover, existing methods struggle to handle embedded images within table cells. Notably, visual representations inherently preserve both structural and content information while being format-agnostic. This insight motivates our exploration of image-based table retrieval, as it can naturally overcome the challenges faced by existing methods. In this paper, we introduce TaR-ViR (Table Retrieval via Visual Representations), a new benchmark that reformulates table retrieval as a multimodal task by treating tables as images. Experiments on TaR-ViR show that this paradigm shift achieved more effective and efficient retrieval performance. Crucially, it eliminates the need for error-prone text conversion, enabling scalable collection and utilization of open-world tables. Our data are available at https://github.com/Trustworthy-Information-Access/Tab-ViR.
New Hybrid Fine-Tuning Paradigm for LLMs: Algorithm Design and Convergence Analysis Framework
Shaocong Ma ⋅ Peiran Yu ⋅ Heng Huang
Fine-tuning Large Language Models (LLMs) typically involves either full fine-tuning, which updates all model parameters, or Parameter-Efficient Fine-Tuning (PEFT), which adjusts a small subset of parameters. However, both approaches have inherent limitations: full fine-tuning is computationally expensive, while PEFT often struggles to learn new knowledge and exhibits suboptimal performance. To overcome these issues, we propose a novel hybrid fine-tuning approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze our new algorithm, we develop a theoretical framework centered on the concept of hybrid smoothness condition, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We derive a rigorous convergence analysis for the convergence of reshuffling-type SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. On the practical side, our results demonstrate consistent performance improvement, making the approach a viable solution for large-scale language model fine-tuning.
Healthcare Insurance Fraud Detection via Continual Fiedler Vector Graph Model
Yehan Zhang ⋅ Huaidong Zhang ⋅ Xuandi Luo ⋅ Shengfeng He
Healthcare insurance fraud detection presents unique machine learning challenges: labeled data are scarce due to delayed verification processes, and fraudulent behaviors evolve rapidly, often manifesting in complex, graph-structured interactions. Existing methods struggle in such settings. Pretraining routines typically overlook structural anomalies under limited supervision, while online models often fail to adapt to changing fraud patterns without labeled updates. To address these issues, we propose the Continual Fiedler Vector Graph model (ConFVG), a fraud detection framework designed for label-scarce and non-stationary environments. The framework comprises two key components. To mitigate label scarcity, we develop a Fiedler Vector-guided graph autoencoder that leverages spectral graph properties to learn structure-aware node representations. The Fiedler Vector, derived from the second smallest eigenvalue of the graph Laplacian, captures global topological signals such as community boundaries and connectivity bottlenecks, which are patterns frequently associated with collusive fraud. This enables the model to identify structurally anomalous nodes without relying on labels. To handle evolving graph streams, we propose a Subgraph Attention Fusion (SAF) module that constructs neighborhood subgraphs and applies attention-based reweighting to emphasize emerging high-risk structures. This design allows the model to adapt to new fraud patterns in real time. A Mean Teacher mechanism further stabilizes online updates and prevents forgetting of previously acquired knowledge. Experiments on real-world medical fraud datasets demonstrate that the Continual Fiedler Vector Graph model outperforms state-of-the-art baselines in both low-label and distribution-shift scenarios, offering a scalable and structure-sensitive solution for real-time fraud detection.
Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
Fei Wu ⋅ Jia Hu ⋅ Geyong Min ⋅ Shiqiang Wang
Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parameter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations with higher rank, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during training to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency in PEFT.
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
Filippo Rinaldi ⋅ Aniello Panariello ⋅ Giacomo Salici ⋅ Fengyuan Liu ⋅ Marco Ciccone ⋅ Angelo Porrello ⋅ Simone Calderara
When a new release of a foundation model is published, practitioners typically need to repeat fine-tuning, even if the same task was already tackled in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, these vectors often fail to transfer across different pre-trained models because their parameter spaces are misaligned. In this work, we show that successful transfer depends strongly on the gradient-sign structure of the new model. Based on this insight, we propose GradFix, which approximates the ideal sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: we only compute a few target-model gradients without parameter updates and mask the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning. We further show that transporting task vectors improves multi-task and multi-source model merging. Code is available at https://github.com/fillo-rinaldi/GradFix.
Expert Merging: Model Merging with Unsupervised Expert Alignment and Importance-Guided Layer Chunking
Dengming Zhang ⋅ Xiaowen Ma ⋅ Zhenliang Ni ⋅ Zhenkai Wu ⋅ Han Shu ⋅ Xin Jiang ⋅ Xinghao Chen
Model merging, which combines multiple domain-specialized experts into a single model, offers a practical path to endow Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) with broad capabilities without the cost of joint training or serving many models. However, training-free methods rely on hand-tuned coefficients, whereas training-based methods primarily align parameters rather than downstream task behavior and typically treat all layers uniformly, ignoring inter-layer heterogeneity. We introduce Expert Merging, a training-light method that learns a small set of layer-wise coefficients using only unlabeled calibration data. The coefficients are optimized to explicitly align the merged model’s hidden states and logits with those of the corresponding experts, with a coefficient regularizer for stability and task-weighted losses for controllable trade-offs. To capture inter-layer variation, Expert Merging++ augments this design with importance-guided chunking: a normalized layer-importance metric, derived from learned coefficients, task-vector magnitudes, and parameter counts, allocates more chunk-wise coefficients to high-importance layers while keeping low-importance layers lightweight. The result is a label-free, parameter-efficient, and scalable approach to multi-expert model merging across LLMs and MLLMs. Across MLLM backbones (InternVL and Qwen2-VL) and the LLM backbone (Mistral), our method surpasses strong training-free and training-based merging baselines, with Expert Merging++ delivering further gains and, in some cases, even exceeding supervised Mixture Training. Our code is available at https://github.com/Littleor/ExpertMerging.
Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature
Angelo Porrello ⋅ Pietro Buzzega ⋅ Felix Dangel ⋅ Thomas Sommariva ⋅ Riccardo Salami ⋅ Lorenzo Bonicelli ⋅ Simone Calderara
Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.
In-Context Learning of Temporal Point Processes with Foundation Inference Models
David Berghaus ⋅ Patrick Seifner ⋅ Kostadin Cvejoski ⋅ César Ali Ojeda Marin ⋅ Ramses Sanchez
Modeling multi-type event sequences with marked temporal point processes (MTPPs) provides a principled framework for uncovering governing dynamical rules and predicting future events. Current neural approaches to MTPP inference typically require training separate, specialized models for each target system. We pursue a fundamentally different strategy: leveraging amortized inference and in-context learning, we pretrain a deep neural network to infer, in-context, the conditional intensity functions of event histories from a context consisting of sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution over point processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without additional training, or be rapidly finetuned to specific target systems. Experiments show that FIM-PP matches the performance of specialized models on multi-event prediction across common benchmark datasets.
Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Minsu Kim ⋅ Jean-Pierre Falet ⋅ Oliver Richardson ⋅ Xiaoyin Chen ⋅ Moksh Jain ⋅ Sungjin Ahn ⋅ Sungsoo Ahn ⋅ Yoshua Bengio
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
Geometric Autoencoder Priors for Bayesian Inversion: Learn First Observe Later
Arnaud Vadeboncoeur ⋅ Gregory Duthé ⋅ Mark Girolami ⋅ Eleni Chatzi
Uncertainty Quantification (UQ) is paramount for inference in engineering. A common inference task is to recover full-field information of physical systems from a small number of noisy observations, a usually highly ill-posed problem. Sharing information from multiple distinct yet related physical systems can alleviate this ill-posedness. Critically, engineering systems often have complicated variable geometries prohibiting the use of standard multi-system Bayesian UQ. In this work, we introduce Geometric Autoencoders for Bayesian Inversion (GABI), a framework for learning geometry-aware generative models of physical responses that serve as highly informative geometry-conditioned priors for Bayesian inversion. Following a ''learn first, observe later'' paradigm, GABI distills information from large datasets of systems with varying geometries, without requiring knowledge of governing PDEs, boundary conditions, or observation processes, into a rich latent prior. At inference time, this prior is seamlessly combined with the likelihood of a specific observation process, yielding a geometry-adapted posterior distribution. Our proposed framework is architecture-agnostic. A creative use of Approximate Bayesian Computation (ABC) sampling yields an efficient implementation that utilizes modern GPU hardware. We test our method on: steady-state heat over rectangular domains; Reynolds-Averaged Navier-Stokes (RANS) flow around airfoils; Helmholtz resonance and source localization on 3D car bodies; RANS airflow over terrain. We find: the predictive accuracy to be comparable to deterministic supervised learning approaches in the restricted setting where supervised learning is applicable; UQ to be well calibrated and robust on challenging problems with complex geometries.
Alternating Diffusion for Proximal Sampling with Zeroth Order Queries
Hirohane Takagi ⋅ Atsushi Nitanda
This work introduces a new approximate proximal sampler that operates solely with zeroth-order information of the potential function. Prior theoretical analyses have revealed that proximal sampling corresponds to alternating forward and backward iterations of the heat flow. The backward step was originally implemented by rejection sampling, whereas we directly simulate the dynamics. Unlike diffusion-based sampling methods that estimate scores via learned models or by invoking auxiliary samplers, our method treats the intermediate particle distribution as a Gaussian mixture, thereby yielding a Monte Carlo score estimator from directly samplable distributions. Theoretically, when the score estimation error is sufficiently controlled, our method inherits the exponential convergence of proximal sampling under isoperimetric conditions on the target distribution. In practice, the algorithm avoids rejection sampling, permits flexible step sizes, and runs with a deterministic runtime budget. Numerical experiments demonstrate that our approach converges rapidly to the target distribution, driven by interactions among multiple particles and by exploiting parallel computation.
PriorGuide: Test-Time Prior Adaptation for Simulation-Based Inference
Yang Yang ⋅ Severi Rissanen ⋅ Paul Chang ⋅ Nasrulloh Satrio ⋅ Daolang Huang ⋅ Arno Solin ⋅ Markus Heinonen ⋅ Luigi Acerbi
Amortized simulator-based inference offers a powerful framework for tackling Bayesian inference in computational fields such as engineering or neuroscience, increasingly leveraging modern generative methods like diffusion models to map observed data to model parameters or future predictions. These approaches yield posterior or posterior-predictive samples for new datasets without requiring further simulator calls after training on simulated parameter-data pairs. However, their applicability is often limited by the prior distribution(s) used to generate model parameters during this training phase. To overcome this constraint, we introduce PriorGuide, a technique specifically designed for diffusion-based amortized inference methods. PriorGuide leverages a novel guidance approximation that enables flexible adaptation of the trained diffusion model to new priors at test time, crucially without costly retraining. This allows users to readily incorporate updated information or expert knowledge post-training, enhancing the versatility of pre-trained inference models.
Uncertainty-Aware Diagnostics for Physics-Informed Machine Learning
Mara Daniels ⋅ Liam Hodgkinson ⋅ Michael W Mahoney
Physics-informed machine learning (PIML) integrates prior physical information, often in the form of differential equation constraints, into the process of fitting ML models to physical data. Popular PIML approaches, including neural operators, physics-informed neural networks, and neural ordinary differential equations, are typically fit to objectives that simultaneously include both data and physical constraints. However, the multi-objective nature of this approach creates ambiguity in the measurement of model quality. This is related to a poor understanding of epistemic uncertainty, and it can lead to surprising failure modes, even when existing metrics suggest strong fits. Working within a Gaussian process regression framework, we introduce the Physics-Informed Log Evidence (PILE) score. Bypassing the ambiguities of test losses, the PILE score is a single, uncertainty-aware metric that provides a selection principle for hyperparameters of a physics-informed model. We show that PILE minimization yields excellent choices for a wide variety of model parameters, including kernel bandwidth, least squares regularization weights, and even kernel function selection. We also show that, prior to data acquisition, a special data-free case of the PILE score identifies a-priori kernel choices that are "well adapted" to a given PDE. Beyond the kernel setting, we anticipate that the PILE score can be extended to PIML at large, and we outline approaches to do so.
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Yu Feng ⋅ Nathaniel Weir ⋅ Kaj Bostrom ⋅ Sam Bayless ⋅ Darion Cassel ⋅ Sapana Chaudhary ⋅ Benjamin Kiesl-Reiter ⋅ Huzefa Rangwala
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench-SARA, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT’s verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
Resurfacing the Instance-only Dependent Label Noise Model through Loss Correction
Mustafa Enes Aydın ⋅ Maarten De Vos ⋅ Alexander Bertrand
We investigate the label noise problem in supervised binary classification settings and resurface the underutilized instance-only dependent noise model through loss correction. On the one hand, based on risk equivalence, the instance-aware loss correction scheme completes the bridge from empirical noisy risk minimization to true clean risk minimization provided the base loss is classification calibrated (e.g., cross-entropy). On the other hand, the instance-only dependent modeling of the label noise at the core of the correction enables us to estimate a single value per instance instead of a matrix. Furthermore, the estimation of the transition rates becomes a very flexible process, for which we offer several computationally efficient ways. Empirical findings over different dataset domains (image, audio, tabular) with different learners (neural networks, gradient-boosted machines) validate the promised generalization ability of the method.
Multi-Condition Conformal Selection
Qingyang Hao ⋅ Wenbo Liao ⋅ Bingyi Jing ⋅ Hongxin Wei
Selecting high-quality candidates from large-scale datasets is critically important in resource-constrained applications such as drug discovery, precision medicine, and the alignment of large language models. While conformal selection methods offer a rigorous solution with False Discovery Rate (FDR) control, their applicability is confined to single-threshold scenarios (i.e., y > c) and overlooks practical needs for multi-condition selection, such as conjunctive or disjunctive conditions. In this work, we propose the Multi-Condition Conformal Selection (MCCS) algorithm, which extends conformal selection to scenarios with multiple conditions. In particular, we introduce a novel nonconformity score with regional monotonicity for conjunctive conditions and a global Benjamini–Hochberg (BH) procedure for disjunctive conditions, thereby establishing finite-sample FDR control with theoretical guarantees. The integration of these components enables the proposed method to achieve rigorous FDR-controlled selection in various multi-condition environments. Extensive experiments validate the superiority of MCCS over baselines, its generalizability across diverse condition combinations, different real-world modalities, and multi-task scalability.
OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
Karen Ullrich ⋅ Jingtong Su ⋅ Claudia Shi ⋅ Arjun Subramonian ⋅ Amir Bar ⋅ Ivan Evtimov ⋅ Nikolaos Tsilivis ⋅ Randall Balestriero ⋅ Julia Kempe ⋅ Mark Ibrahim
Reliability is key to realizing the promise of autonomous UI-agents, multimodal agents that directly interact with the apps humans use, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments---often clones of existing apps--- which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than 50\% across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from 63\% to just 4\% across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations.
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct distillation from Claude 3.7 Sonnet.
Boosted Trees on a Diet: Compact Models for Resource-Constrained Devices
Nina Herrmann ⋅ Jan Stenkamp ⋅ Benjamin Karic ⋅ Stefan Oehmcke ⋅ Fabian Gieseke
Deploying machine learning models on compute-constrained devices has become a key building block of modern IoT applications. In this work, we present a compression scheme for boosted decision trees, addressing the growing need for lightweight machine learning models. Specifically, we provide techniques for training compact boosted decision tree ensembles that exhibit a reduced memory footprint by rewarding, among other things, the reuse of features and thresholds during training. Our experimental evaluation shows that models achieved the same performance with a compression ratio of 4–16x compared to LightGBM models using an adapted training process and an alternative memory layout. Once deployed, the corresponding IoT devices can operate independently of constant communication or external energy supply, and, thus, autonomously, requiring only minimal computing power and energy. This capability opens the door to a wide range of IoT applications, including remote monitoring, edge analytics, and real-time decision making in isolated or power-limited environments.
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs
Hongyi Liu ⋅ Jiaji Huang ⋅ Zhen Jia ⋅ Youngsuk Park ⋅ Yu-Xiang Wang
Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets, demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available, especially when long reasoning chains are required.
Revisiting Active Sequential Prediction-Powered Mean Estimation
Maria-Eleni Sfyraki ⋅ Jun-Kun Wang
In this work, we revisit the problem of active sequential prediction-powered mean estimation, where at each round one must decide the query probability of the ground-truth label upon observing the covariates of a sample. Furthermore, if the label is not queried, the prediction from a machine learning model is used instead. Prior work proposed an elegant scheme that determines the query probability by combining an uncertainty-based suggestion with a constant probability that encodes a soft constraint on the query probability. We explored different values of the mixing parameter and observed an intriguing empirical pattern: the smallest confidence width tends to occur when the weight on the constant probability is close to one, thereby reducing the influence of the uncertainty-based component. Motivated by this observation, we develop a non-asymptotic analysis of the estimator and establish a data-dependent bound on its confidence interval. Our analysis further suggests that when a no-regret learning approach is used to determine the query probability and control this bound, the query probability converges to the constraint of the max value of the query probability when it is chosen obliviously to the current covariates. We also conduct simulations that corroborate these theoretical findings.
Angle K-Means
Shenfei Pei ⋅ Ruiyu Huang ⋅ Yiqing Hu ⋅ Zhongqi Lin ⋅ Jiang Xudong ⋅ Zengwei Zheng
We propose an accelerated exact $k$-means algorithm, Angle $k$-means. As its name suggests, the algorithm mainly leverages angular relationships between data points and cluster centers to reduce computational overhead. Although grounded in straightforward geometric principles, it delivers substantial performance improvements in empirical evaluations. In contrast to existing acceleration techniques, our model introduces no new hyperparameters, preserving full compatibility with standard $k$-means. Theoretical analysis shows that Angle $k$-means maintains linear time complexity with respect to both sample size and dimensionality, while empirical evaluations on diverse real-world datasets demonstrate significant speedup over state-of-the-art algorithms such as ball $k$-means and Exp-ns.
Accessible, Realistic, and Fair Evaluation of Positive-Unlabeled Learning Algorithms
Wei Wang ⋅ Dong-Dong Wu ⋅ Ming Li ⋅ Jingxiong Zhang ⋅ Gang Niu ⋅ Masashi Sugiyama
Positive-unlabeled (PU) learning is a weakly supervised binary classification problem, in which the goal is to learn a binary classifier from only positive and unlabeled data, without access to negative data. In recent years, many PU learning algorithms have been developed to improve model performance. However, experimental settings are highly inconsistent, making it difficult to identify which algorithm performs better. In this paper, we propose the first PU learning benchmark to systematically compare PU learning algorithms. During our implementation, we identify subtle yet critical factors that affect the realistic and fair evaluation of PU learning algorithms. On the one hand, many PU learning algorithms rely on a validation set that includes negative data for model selection. This is unrealistic in traditional PU learning settings, where no negative data are available. To handle this problem, we systematically investigate model selection criteria for PU learning. On the other hand, PU learning involves different problem settings and corresponding solution families, i.e., the one-sample and two-sample settings. However, existing evaluation protocols are heavily biased towards the one-sample setting and neglect the significant difference between them. We identify the internal label shift problem of unlabeled training data for the one-sample setting and propose a simple yet effective calibration approach to ensure fair comparisons within and across families. We hope our framework will provide an accessible, realistic, and fair environment for evaluating PU learning algorithms in the future.
Boosting for Predictive Sufficiency
Gowtham Reddy Abbavaram ⋅ Rajeev Verma ⋅ Celia Rubio-Madrigal ⋅ Krikamol Muandet ⋅ Rebekka Burkholz
Out-of-distribution (OOD) generalization is a defining hallmark of truly robust and reliable machine learning systems. Recently, it has been empirically observed that existing OOD generalization methods often underperform on real-world tabular data, where hidden confounding shifts drive distribution changes that boosting models handle more effectively. Part of boosting’s success is attributed to variance reduction, handling missing variables, feature selection, and connections to multicalibration. This paper uncovers a crucial reason behind its success in OOD generalization: boosting’s ability to infer stable environments robust to hidden confounding shifts and maximize predictive performance within those environments. This paper introduces an information-theoretic notion called $\alpha$-predictive sufficiency and formalizes its link to OOD generalization under hidden confounding. We show that boosting implicitly identifies suitable environments and produces an $\alpha$-predictive sufficient predictor. We validate our theoretical results through synthetic and real-world experiments and show that boosting achieves robust performance by identifying these environments and maximizing the association between predictions and true outcomes.
Adaptive Conformal Guidance for Learning under Uncertainty
Rui Liu ⋅ Peng Gao ⋅ Yu Shen ⋅ Ming Lin ⋅ Pratap Tokekar
Learning with guidance has proven effective across a wide range of machine learning systems. Guidance may, for example, come from annotated datasets in supervised learning, pseudo-labels in semi-supervised learning, and expert demonstration policies in reinforcement learning. However, guidance signals can be noisy due to domain shifts and limited data availability and may not generalize well. Blindly trusting such signals when they are noisy, incomplete, or misaligned with the target domain can lead to degraded performance. To address these challenges, we propose Adaptive Conformal Guidance (AdaConG), a simple yet effective approach that dynamically modulates the influence of guidance signals based on their associated uncertainty, quantified via split conformal prediction (CP). By adaptively adjusting to guidance uncertainty, AdaConG enables models to reduce reliance on potentially misleading signals and enhance learning performance. We validate AdaConG across diverse tasks, including knowledge distillation, semi-supervised image classification, gridworld navigation, and autonomous driving. Experimental results demonstrate that AdaConG improves performance and robustness under imperfect guidance, e.g., in gridworld navigation, it accelerates convergence and achieves over $\times 6$ higher rewards than the best-performing baseline. These results highlight AdaConG as a broadly applicable solution for learning under uncertainty.
MedAraBench: Large-scale Arabic Medical Question Answering Dataset and Benchmark
Mouath Abu-Daoud ⋅ Leen Kharouf ⋅ Omar El Hajj ⋅ Dana Samad ⋅ Mariam Al-Omari ⋅ Jihad Mallat ⋅ Khaled Saleh ⋅ Nizar Habash ⋅ Farah Shamout
Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of sixteen state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We also explore QLoRA fine-tuning on LLaMa-3.1-8B-instruct to assess our dataset's viability. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
Bandits with Single-Peaked Preferences and Limited Resources
Omer Ben-Porat ⋅ Gur Keinan ⋅ Rotem Torkan
We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences---a well-established structure in social choice theory, where users' preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.
back arrowGo to TMLR homepage Slicing the Gaussian Mixture Wasserstein Distance
Moritz Piening · Robert Beinert
Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost—arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program—limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization
BoGrape: Bayesian optimization over graphs with shortest-path encoded
Yilin Xie ⋅ Shiqiang Zhang ⋅ Jixiang Qing ⋅ Ruth Misener ⋅ Calvin Tsay
Graph-structured data are central to many scientific and industrial applications where the goal is to optimize expensive black-box objectives defined over graph structures or node configurations---as seen in molecular design, supply chains, and sensor placement. Bayesian optimization offers a principled approach for such settings, but existing methods largely focus on functions defined over nodes of a fixed graph. Moreover, graph optimization is often approached heuristically, and it remains unclear how to systematically incorporate structural constraints into BO. To address these gaps, we build on shortest-path graph kernels to develop a principled framework for acquisition optimization over unseen graph structures and associated node attributes. Through a novel formulation based on mixed-integer programming, we enable global exploration of the combinatorial domain over graph structures and explicit embedding of problem-specific constraints. We demonstrate that our method, BoGrape, is competitive both on general synthetic benchmarks and representative molecular design case studies with application-specific constraints.
Neural Predictor-Corrector: Solving Homotopy Problems with Reinforcement Learning
Jiayao Mai ⋅ Bangyan Liao ⋅ Zhenjun Zhao ⋅ Yingping Zeng ⋅ Haoang Li ⋅ Javier Civera ⋅ Tailin Wu ⋅ Yi Zhou ⋅ Peidong Liu
The Homotopy paradigm, a general principle for solving challenging problems, appears across diverse domains such as robust optimization, global optimization, polynomial root-finding, and sampling. Practical solvers for these problems typically follow a predictor-corrector (PC) structure, but rely on hand-crafted heuristics for step sizes and iteration termination, which are often suboptimal and task-specific. To address this, we unify these problems under a single framework, which enables the design of a general neural solver. Building on this unified view, we propose Neural Predictor-Corrector (NPC), which replaces hand-crafted heuristics with automatically learned policies. NPC formulates policy selection as a sequential decision-making problem and leverages reinforcement learning to automatically discover efficient strategies. To further enhance generalization, we introduce an amortized training mechanism, enabling one-time offline training for a class of problems and efficient online inference on new instances. Experiments on four representative homotopy problems demonstrate that our method generalizes effectively to unseen instances. It consistently outperforms classical and specialized baselines in efficiency while demonstrating superior stability across tasks, highlighting the value of unifying homotopy methods into a single neural framework.
RRNCO: Towards Real-World Routing with Neural Combinatorial Optimization
Jiwoo Son ⋅ Zhikai Zhao ⋅ Federico Berto ⋅ Chuanbo Hua ⋅ Zhiguang Cao ⋅ Changhyun Kwon ⋅ Jinkyoo Park
The practical deployment of Neural Combinatorial Optimization (NCO) for Vehicle Routing Problems (VRPs) is hindered by a critical sim-to-real gap. This gap stems not only from training on oversimplified Euclidean data but also from node-based architectures incapable of handling the node-and-edge-based features with correlated asymmetric cost matrices, such as those for real-world distance and duration. We introduce RRNCO, a novel architecture specifically designed to address these complexities. RRNCO's novelty lies in two key innovations. First, its Adaptive Node Embedding (ANE) efficiently fuses spatial coordinates with real-world distance features using a learned contextual gating mechanism. Second, its Neural Adaptive Bias (NAB) is the first mechanism to jointly model asymmetric distance, duration, and directional angles, enabling it to capture complex, realistic routing constraints. Moreover, we introduce a new VRP benchmark grounded in real-world data crucial for bridging this sim-to-real gap, featuring asymmetric distance and duration matrices from 100 diverse cities, enabling the training and validation of NCO solvers on tasks that are more representative of practical settings. Experiments demonstrate that RRNCO achieves state-of-the-art performance on this benchmark, significantly advancing the practical applicability of neural solvers for real-world logistics. Our code, dataset, and pretrained models are available at https://github.com/ai4co/real-routing-nco.
Human-LLM Collaborative Feature Engineering for Tabular Data
Zhuoyan Li ⋅ Aditya Bansal ⋅ Jinzhao Li ⋅ Shishuang He ⋅ Zhuoran Lu ⋅ Mutian Zhang ⋅ Qin Liu ⋅ Yiwei Yang ⋅ Swati Jain ⋅ Ming Yin ⋅ Yunyao Li
Large language models (LLMs) are increasingly used to automate feature engineering in tabular learning. Given task-specific information, LLMs can propose diverse feature transformation operations to enhance downstream model performance. However, current approaches typically assign the LLM as a black-box optimizer, responsible for both proposing and selecting operations based solely on its internal heuristics, which often lack calibrated estimations of operation utility and consequently lead to repeated exploration of low-yield operations without a principled strategy for prioritizing promising directions. In this paper, we propose a human–LLM collaborative feature engineering framework for tabular learning. We begin by decoupling the transformation operation proposal and selection processes, where LLMs are used solely to generate operation candidates, while the selection is guided by explicitly modeling the utility and uncertainty of each proposed operation. Since accurate utility estimation can be difficult especially in the early rounds of feature engineering, we design a mechanism within the framework that selectively elicits and incorporates human expert preference feedback—comparing which operations are more promising—into the selection process to help identify more effective operations. Our evaluations on both the synthetic study and the real user study demonstrate that the proposed framework improves feature engineering performance across a variety of tabular datasets and reduces users’ cognitive load during the feature engineering process.
Towards Better Branching Policies: Leveraging the Sequential Nature of Branch-and-Bound Tree
Ce Zhang ⋅ Bin Zhang ⋅ Guoliang Fan
The branch-and-bound (B&B) method is a dominant exact algorithm for solving Mixed-Integer Linear Programming problems (MILPs). While recent deep learning approaches have shown promise in learning branching policies using instance-independent features, they often struggle to capture the sequential decision-making nature of B&B, particularly over long horizons with complex inter-step dependencies and intra-step variable interactions. To address these challenges, we propose Mamba-Branching, a novel learning-based branching policy that leverages the Mamba architecture for efficient long-sequence modeling, enabling effective capture of temporal dynamics across B&B steps. Additionally, we introduce a contrastive learning strategy to pre-train discriminative embeddings for candidate branching variables, significantly enhancing Mamba's performance. Experimental results demonstrate that Mamba-Branching outperforms all previous neural branching policies on real-world MILP instances and achieves superior computational efficiency compared to the advanced open-source solver SCIP. The source code can be accessed at https://github.com/doctor-watson626/Mamba-Branching/.
Celo2: Towards Learned Optimization Free Lunch
Abhinav Moudgil ⋅ Boris Knyazev ⋅ Eugene Belilovsky
Learned optimizers are powerful alternatives to hand-designed update rules like Adam, yet they have seen limited practical adoption since they often fail to meta-generalize beyond their training distribution and incur high meta-training cost. For instance, prior work, VeLO, scaled meta-training to 4,000 TPU months ($\sim$10$\times$ GPT-3 compute) to meta-train a general-purpose optimizer but it failed to generalize beyond 600M parameters tasks. In this work, we present a surprising finding: by crafting a simple normalized optimizer architecture and augmenting meta-training, it becomes feasible to meta-train a performant general-purpose learned update rule on a tiny fraction of VeLO compute, 4.5 GPU hours to be precise. Our learned update rule scales stably to a billion-scale pretraining task (GPT-3 XL 1.3B) which is six orders of magnitude larger than its meta-training distribution. Furthermore, it shows strong performance across diverse out-of-distribution tasks and is compatible with modern optimization harness that includes orthogonalization, distinct update rules for input-output and hidden weights, and decoupled weight decay. In all, this work paves the way for practically applicable _learnable_ optimization algorithms, unlocking exploration of richer meta-training and data curation recipes to further improve performance.
AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms
Zhenxing Xu ⋅ Yizhe Zhang ⋅ Weidong Bao ⋅ Hao Wang ⋅ Ming Chen ⋅ Haoran Ye ⋅ Wenzheng Jiang ⋅ Hui Yan ⋅ Ji Wang
Dynamically configuring algorithm hyperparameters is a fundamental challenge in computational intelligence. While learning-based methods offer automation, they suffer from prohibitive sample complexity and poor generalization. We introduce AutoEP, a novel framework that bypasses training entirely by leveraging Large Language Models (LLMs) as zero-shot reasoning engines for algorithm control. AutoEP's core innovation lies in a tight synergy between two components: (1) an online Exploratory Landscape Analysis (ELA) module that provides real-time, quantitative feedback on the search dynamics, and (2) a multi-LLM reasoning chain that interprets this feedback to generate adaptive hyperparameter strategies. This approach grounds high-level reasoning in empirical data, mitigating hallucination. Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods. Notably, our framework enables open-source models like Qwen3-30B to match the performance of GPT-4, demonstrating a powerful and accessible new paradigm for automated hyperparameter design.Our code is available at https://anonymous.4open.science/r/AutoEP-3E11.
Learning to Adapt: In-Context Learning Beyond Stationarity
Zhen Qin ⋅ Jiachen Jiang ⋅ Zhihui Zhu
Transformer models have become foundational across a wide range of scientific and engineering domains due to their strong empirical performance. A key capability underlying their success is in-context learning (ICL): when presented with a short prompt from an unseen task, transformers can perform per-token and next-token predictions without any parameter updates. Recent theoretical efforts have begun to uncover the mechanisms behind this phenomenon, particularly in supervised regression settings. However, these analyses predominantly assume stationary task distributions, which overlook a broad class of real-world scenarios where the target function varies over time. In this work, we bridge this gap by providing a theoretical analysis of ICL under non-stationary regression problems. We study how the gated linear attention (GLA) mechanism adapts to evolving input-output relationships and rigorously characterize its advantages over standard linear attention in this dynamic setting. To model non-stationarity, we adopt a first-order autoregressive process and show that GLA achieves lower training and testing errors by adaptively modulating the influence of past inputs--effectively implementing a learnable recency bias. Our theoretical findings are further supported by empirical results, which validate the benefits of gating mechanisms in non-stationary ICL tasks.
Improving Feasibility via Fast Autoencoder-Based Projections
Maria Chzhen ⋅ Priya Donti
Enforcing complex (e.g., nonconvex) operational constraints is a critical challenge in real-world learning and control systems. However, existing methods struggle to efficiently enforce general classes of constraints. To address this, we propose a novel data-driven amortized approach that uses a trained autoencoder as an approximate projector to provide fast corrections to infeasible predictions. Specifically, we train an autoencoder using an adversarial objective to learn a structured, convex latent representation of the feasible set. This enables rapid correction of neural network outputs by projecting their associated latent representations onto a simple convex shape before decoding into the original feasible set. We test our approach on a diverse suite of constrained optimization and reinforcement learning problems with challenging nonconvex constraints. Results show that our method effectively enforces constraints at a low computational cost, offering a practical alternative to expensive feasibility correction techniques based on traditional solvers.
Neural Graduated Assignment for Maximum Common Edge Subgraphs
Chaolong Ying ⋅ Yingqi Ruan ⋅ Xuemin Chen ⋅ Yaomin Wang ⋅ Tianshu Yu
The Maximum Common Edge Subgraph (MCES) problem is a crucial challenge with significant implications in domains such as biology and chemistry. Traditional approaches, which include transformations into max-clique and search-based algorithms, suffer from scalability issues when dealing with larger instances. This paper introduces ``Neural Graduated Assignment'' (NGA), a simple, scalable, unsupervised-training-based method that addresses these limitations. Central to NGA is stacking of differentiable assignment optimization with neural components, enabling high-dimensional parameterization of the matching process through a learnable temperature mechanism. We further theoretically analyze the learning dynamics of NGA, showing its design leads to fast convergence, better exploration-exploitation tradeoff, and ability to escape local optima. Extensive experiments across MCES computation, graph similarity estimation, and graph retrieval tasks reveal that NGA not only significantly improves computation time and scalability on large instances but also enhances performance compared to existing methodologies. The introduction of NGA marks a significant advancement in the computation of MCES and offers insights into other assignment problems.
HBO: Hierarchical Balancing Optimization for Fine-Tuning Large Language Models
Weixuan Wang ⋅ Minghao Wu ⋅ Barry Haddow ⋅ Alexandra Birch
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which measure learning progress and relative performance improvement. We evaluate HBO on three LLM backbones across nine diverse tasks in multilingual and multitask setups. Results show that HBO consistently outperforms existing baselines, achieving significant accuracy gains. Our in-depth analysis further demonstrates that both the global actor and local actors of HBO effectively adjust data usage during fine-tuning. HBO provides a comprehensive solution to the challenges of data imbalance and heterogeneity in LLM fine-tuning, enabling more effective training across diverse datasets.
Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers
Panagiotis D. Grontas ⋅ Antonio Terpin ⋅ Efe Balta ⋅ Raffaello D'Andrea ⋅ John Lygeros
We introduce an output layer for neural networks that ensures satisfaction of convex constraints. Our approach, $\Pi$net, leverages operator splitting for rapid and reliable projections in the forward pass, and the implicit function theorem for backpropagation. We deploy $\Pi$net as a feasible-by-design optimization proxy for parametric constrained optimization problems and obtain modest-accuracy solutions faster than traditional solvers when solving a single problem, and significantly faster for a batch of problems. We surpass state-of-the-art learning approaches by orders of magnitude in terms of training time, solution quality, and robustness to hyperparameter tuning, while maintaining similar inference times. Finally, we tackle multi-vehicle motion planning with non-convex trajectory preferences and provide $\Pi$net as a GPU-ready package implemented in JAX.
We analyze Muon as originally proposed, using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point with the same rate as the SVD-polar idealization, up to a constant factor for given the number of Newton-Schulz steps $q$. We further analyze this constant factor, and prove that it converges to 1 doubly exponentially in $q$ and improves with the degree of a polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon improves the rank dependence compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at much faster wall-clock time, and explain how much momentum matrix orthogonalization via Newton-Schulz benefits over the vector-based optimizer. Overall, our theory justifies the practical Newton-Schulz design of Muon, narrowing its practice–theory gap.
ARFlow: Auto-regressive Optical Flow Estimation for Arbitrary-Length Videos via Progressive Next-Frame Forecasting
Jiuming Liu ⋅ Mengmeng Liu ⋅ Siting Zhu ⋅ Yunpeng Zhang ⋅ Jiangtao Li ⋅ Michael Yang ⋅ Francesco Nex ⋅ Hao Cheng ⋅ Hesheng Wang
Optical flow estimation is a fundamental computer vision task that predicts per-pixel displacements from consecutive images. Recent works attempt to exploit temporal cues to improve the estimation performance. However, their temporal modeling is restricted to short video sequences due to the unaffordable computational burden, thereby suffering from restricted temporal receptive fields. Moreover, their group-wise paradigm in one forward pass undermines inter-group information exchange, leading to modest performance improvement. To address these problems, we propose a novel multi-frame optical flow network based on an auto-regressive paradigm, named ARFlow. Unlike previous multi-frame methods, our method can be scalable to arbitrary-length videos with marginal computational overhead. Specifically, we design an Auto-regressive Flow Initialization (AFI) module and an Auto-regressive Multi-stride Flow Refinement (AMFR) module to forecast the next-frame flow based on multi-stride history observations. Our ARFlow achieves state-of-the-art performance, ranking 1st on both KITTI-2015 and Spring official benchmarks and 2nd on the MPI-Sintel (Final) benchmark among all open-sourced methods. Furthermore, due to the auto-regressive nature, our method can generalize to arbitrary video length with a constant GPU memory usage of 2.1GB.
Faster Gradient Methods for Highly-smooth Stochastic Bilevel Optimization
Lesi Chen ⋅ Junru Li ⋅ EL MAHDI CHAYTI ⋅ Jingzhao Zhang
This paper studies the complexity of finding an $\epsilon$-stationary point for stochastic bilevel optimization when the upper-level problem is nonconvex and the lower-level problem is strongly convex. Recent work proposed the first-order method, F${}^2$SA, achieving the $\tilde{\mathcal{O}}(\epsilon^{-6})$ upper complexity bound for first-order smooth problems. This is slower than the optimal $\Omega(\epsilon^{-4})$ complexity lower bound in its single-level counterpart. In this work, we show that faster rates are achievable for higher-order smooth problems. We first reformulate F$^2$SA as approximating the hyper-gradient with a forward difference. Based on this observation, we propose a class of methods F${}^2$SA-$p$ that uses $p$th-order finite difference for hyper-gradient approximation and improves the upper bound to $\tilde{\mathcal{O}}(p \epsilon^{-4-2/p})$ for $p$th-order smooth problems. Finally, we demonstrate that the $\Omega(\epsilon^{-4})$ lower bound also holds for stochastic bilevel problems when the high-order smoothness holds for the lower-level variable, indicating that the upper bound of F${}^2$SA-$p$ is nearly optimal in the highly smooth region $p = \Omega( \log \epsilon^{-1} / \log \log \epsilon^{-1})$.
Cautious Weight Decay
Lizhang Chen ⋅ Jonathan Li ⋅ Kaizhao Liang ⋅ Baiyu Su ⋅ Cong Xie ⋅ Chen Liang ⋅ Ni Lao ⋅ Qiang Liu
We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization
Haocheng Luo ⋅ Zehang Deng ⋅ Thanh-Toan Do ⋅ Mehrtash Harandi ⋅ Dinh Phung ⋅ Trung Le
Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia-2.8B, Mistral-7B, and Gemma-2B-IT across multiple datasets and benchmarks demonstrate that logits-SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at .
Diffusion-DFL: Decision-focused Diffusion Models for Stochastic Optimization
Zihao Zhao ⋅ Christopher Yeh ⋅ Lingkai Kong ⋅ Kai Wang
Decision-focused learning (DFL) integrates predictive modeling and optimization by training predictors to optimize the downstream decision target rather than merely minimizing prediction error. To date, existing DFL methods typically rely on deterministic point predictions, which are often insufficient to capture the intrinsic stochasticity of real-world environments. To address this challenge, we propose the first diffusion-based DFL approach, which trains a diffusion model to represent the distribution of uncertain parameters and optimizes the decision by solving a stochastic optimization with samples drawn from the diffusion model. Our contributions are twofold. First, we formulate diffusion DFL using the reparameterization trick, enabling end-to-end training through diffusion. While effective, it is memory and compute-intensive due to the need to differentiate through the diffusion sampling process. Second, we propose a lightweight score function estimator that uses only several forward diffusion passes and avoids backpropagation through the sampling. This follows from our results that backpropagating through stochastic optimization can be approximated by a weighted score function formulation. We empirically show that our diffusion DFL approach consistently outperforms strong baselines in decision quality. The source code for all experiments is available at https://github.com/GT-KOALA/Diffusion_DFL.
Conformal Robustness Control: A New Strategy for Robust Decision
Yang Hu ⋅ Jieren Tan ⋅ Changliang Zou ⋅ Yajie Bao ⋅ Haojie Ren
Robust decision-making is crucial in numerous risk-sensitive applications where outcomes are uncertain and the cost of failure is high. Conditional Robust Optimization (CRO) offers a framework for such tasks by constructing prediction sets for the outcome that satisfy predefined coverage requirements and then making decisions based on these sets. Many existing approaches leverage conformal prediction to build prediction sets with guaranteed coverage for CRO. However, since coverage is a sufficient but not necessary condition for robustness, enforcing such constraints often leads to overly conservative decisions. To overcome this limitation, we propose a novel framework named Conformal Robustness Control (CRC), that directly optimizes the prediction set construction under explicit robustness constraints, thereby enabling more efficient decisions without compromising robustness. We develop efficient algorithms to solve the CRC optimization problem, and also provide theoretical guarantees on both robustness and optimality. Empirical results show that CRC consistently yields more effective decisions than existing baselines while still meeting the target robustness level.
A Sharp KL Convergence Analysis for Diffusion Models under Minimal Assumptions
Nishant Jain ⋅ Tong Zhang
Diffusion-based generative models have emerged as highly effective methods for synthesizing high-quality samples. Recent works have focused on analyzing the convergence of their generation process with minimal assumptions, either through reverse SDEs or probability flow ODEs. The best known guarantees, without any smoothness assumptions, for the KL divergence so far achieve a linear dependence on the data dimension $d$ and an inverse quadratic dependence on accuracy level $\varepsilon$. In this work, we present a refined analysis for the standard Exponential Integrator discretization that improves the dependence on $\varepsilon$, at the same time maintaining the linear dependence on $d$. Following recent works on higher order/randomized midpoint discretizations, we model the generation process as a composition of two steps: a reverse ODE step followed by a smaller noising step, which leads to better dependence on step size. We then provide a novel analysis which achieves linear dependence on $d$ for the ODE discretization error without any smoothness assumptions. Specifically, we introduce a general ODE-based counterpart of the stochastic localization argument from Benton et al and develop new proof techniques to bound second-order spatial derivatives of the score function -- terms that do not arise in previous diffusion analyses and cannot be handled by existing techniques. Leveraging this framework, we prove that $\tilde{O}\left(\tfrac{d \log^{3/2}(1/\delta)}{\varepsilon}\right)$ steps suffice to approximate the target distribution—corrupted by Gaussian noise of variance $\delta$—to within $O(\varepsilon^2)$ in KL divergence, improving upon the previous best result requiring $\tilde{O}\left(\tfrac{d \log^2(1/\delta)}{\varepsilon^2}\right)$ steps.
Strictly Constrained Generative Modeling via Split Augmented Langevin Sampling
Matthieu Blanke ⋅ Yongquan Qu ⋅ Sara Shamekh ⋅ Pierre Gentine
Deep generative models hold great promise for representing complex physical systems, but their deployment is currently limited by the lack of guarantees on the physical plausibility of the generated outputs. Ensuring that known physical constraints are enforced is therefore critical when applying generative models to scientific and engineering problems. We address this limitation by developing a principled framework for sampling from a target distribution while rigorously satisfying physical constraints. Leveraging the variational formulation of Langevin dynamics, we propose Split Augmented Langevin (CASAL), a novel primal-dual sampling algorithm that enforces constraints progressively through variable splitting, with convergence guarantees. While the method is developed theoretically for Langevin dynamics, we demonstrate its effective applicability to diffusion models. We apply our method to diffusion-based data assimilation on a complex physical system, where enforcing physical constraints substantially improves both forecast accuracy and the preservation of critical conserved quantities. We also demonstrate the potential of CASAL for challenging feasibility problems in optimal control.
Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality
Arpan Mukherjee ⋅ Marcello Bullo ⋅ Debabrota Basu ⋅ Deniz Gunduz
While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator’s coverage, (ii) the verifier’s region of convergence (ROC), and (iii) the sampling algorithm’s sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality-coverage curve exhibits three regimes. A transport regime — where sub-optimality increases with coverage, a policy improvement regime — where sub-optimality may decrease with coverage, depending on the verifier’s ROC, and a saturation regime — where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms — sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.
Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces
Bryon Tjanaka ⋅ Henry Chen ⋅ Matthew Fontaine ⋅ Stefanos Nikolaidis
Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms are typically limited to low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the state-of-the-art CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and hence receive the same discount value. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new capabilities for QD algorithms by introducing two new domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other existing black-box QD algorithms.
Converge Faster, Talk Less: Hessian-Informed Federated Zeroth-Order Optimization
Zhe Li ⋅ Bicheng Ying ⋅ Zidong Liu ⋅ Chaosheng Dong ⋅ Haibo Yang
Zeroth-order (ZO) optimization enables dimension-free communication in federated learning (FL), making it attractive for fine-tuning of large language models (LLMs) due to significant communication savings. However, existing ZO-FL methods largely overlook curvature information, despite its well-established benefits for convergence acceleration. To address this, we propose **HiSo**, a Hessian-informed ZO federated optimization method that accelerates convergence by leveraging global diagonal Hessian approximations, while strictly preserving scalar-only communication **without transmitting any second-order information**. Theoretically, for non-convex functions, we show that HiSo can achieve an accelerated convergence rate that is independent of the Lipschitz constant $L$ and model dimension $d$ under some Hessian approximation assumptions, offering a plausible explanation for the observed phenomenon of ZO convergence being much faster than its worst-case $O(d)$-bound. Empirically, across diverse LLM fine-tuning benchmarks, HiSo delivers a 1$\sim$5× speedup in communication rounds over existing state-of-the-art ZO-FL baselines. This superior convergence not only cuts communication costs but also provides strong empirical evidence that Hessian information acts as an effective accelerator in federated ZO optimization settings.
Scaling Multi-Task Bayesian Optimization with Large Language Models
Yimeng Zeng ⋅ Natalie Maus ⋅ Haydn Jones ⋅ Jeffrey Tao ⋅ Fangping Wan ⋅ Marcelo Der Torossian Torres ⋅ Cesar de la Fuente ⋅ Ryan Marcus ⋅ Osbert Bastani ⋅ Jacob Gardner
In multi-task Bayesian optimization, the goal is to leverage experience from optimizing existing tasks to improve the efficiency of optimizing new ones. While approaches using multi-task Gaussian processes or deep kernel transfer exist, the performance improvement is marginal when scaling beyond a moderate number of tasks. We introduce BOLT, an initialization-only transfer strategy that distills prior BO runs into an LLM which proposes candidates for new tasks, while the surrogate at test time remains single-task. The LLM is periodically fine-tuned on top solutions from completed runs, creating a closed loop where better BO outputs yield better initializations over time. This decoupled design scales to roughly 1500 tasks without the saturation observed for shared-surrogate MTBO and adds only a small, amortized overhead relative to the BO inner loops. We evaluate on two domains: database query optimization and antimicrobial peptide design. We demonstrate that LLM-generated initializations steadily improve and accelerate BO, and with sufficient fine-tuning, a few LLM samples often match or surpass full ''from-scratch'' BO with far fewer oracle calls.
FZOO: Fast Zeroth-Order Optimizer for Fine‑Tuning Large Language Models towards Adam‑Scale Speed
Sizhe Dang ⋅ yangyangGuo ⋅ Yanjun Zhao ⋅ Xiaodong Zheng ⋅ Guang Dai ⋅ Ivor Tsang ⋅ Haishan Ye
Fine-tuning large language models (LLMs) often faces GPU memory bottlenecks: the backward pass of first-order optimizers like Adam increases memory usage to more than 10 times the inference level (e.g., 633~GB for OPT-30B). Zeroth-order (ZO) optimizers avoid this cost by estimating gradients only from forward passes, yet existing methods like MeZO usually need tens of times more steps to converge. Can this trade-off between speed and memory in ZO be fundamentally improved? Normalized-SGD, for instance, demonstrates strong empirical performance with greater memory efficiency than Adam. In light of this, we introduce FZOO, a Fast Zeroth-Order Optimizer towards Adam-Scale Speed. On the one hand, FZOO reduces the total forward passes needed for convergence by employing batched one-sided estimates that adapt step-sizes based on the standard deviation of batch losses. On the other hand, it accelerates per-batch computation through the use of Rademacher random vector (±1) perturbations, which also enables further speedups through batched evaluation. Extensive experiments on diverse models (including RoBERTa-large, the OPT family (350M-66B), Phi-2, and Llama3) across 11 varied downstream tasks validate FZOO's effectiveness. On average, FZOO outperforms MeZO by +3% in accuracy while requiring 3$\times$fewer forward passes. Notably, for the RoBERTa-large model, FZOO achieves average improvements of +5.6% in accuracy and 18$\times$reduction in forward passes compared to MeZO, achieving convergence speeds comparable to Adam. We also provide theoretical analysis proving FZOO’s formal equivalence to a normalized-SGD update rule and establishing its convergence guarantees. Beyond full-parameter tuning, FZOO plugs smoothly into PEFT techniques, unlocking even larger memory savings. Taken together, our results make single-GPU, high-speed, full-parameter fine-tuning realistic today and point toward future work on memory-efficient pre-training. Code: https://github.com/DKmiyan/FZOO
Task-free Adaptive Meta Black-box Optimization
Chao Wang ⋅ Licheng Jiao ⋅ Lingling Li ⋅ Jiaxuan Zhao ⋅ Guanchun Wang ⋅ Fang Liu ⋅ Shuyuan Yang
Handcrafted optimizers become prohibitively inefficient for complex black-box optimization (BBO) tasks. MetaBBO addresses this challenge by meta-learning to automatically configure optimizers for low-level BBO tasks, thereby eliminating heuristic dependencies. However, existing methods typically require extensive handcrafted training tasks to learn meta-strategies that generalize to target tasks, which poses a critical limitation for realistic applications with unknown task distributions. To overcome the issue, we propose the Adaptive meta Black-box Optimization Model (ABOM), which performs online parameter adaptation using solely optimization data from the target task, obviating the need for predefined task distributions. Unlike conventional metaBBO frameworks that decouple meta-training and optimization phases, ABOM introduces a closed-loop adaptive parameter learning mechanism, where parameterized evolutionary operators continuously self-update by leveraging generated populations during optimization. This paradigm shift enables zero-shot optimization: ABOM achieves competitive performance on synthetic BBO benchmarks and realistic unmanned aerial vehicle path planning problems without any handcrafted training tasks. Visualization studies reveal that parameterized evolutionary operators exhibit statistically significant search patterns, including natural selection and genetic recombination.
Riemannian Federated Learning via Averaging Gradient Streams
Zhenwei Huang ⋅ Wen Huang ⋅ Pratik Jawanpuria ⋅ Bamdev Mishra
Federated learning (FL) as a distributed learning paradigm has a significant advantage in addressing large-scale machine learning tasks. In the Euclidean setting, FL algorithms have been extensively studied with both theoretical and empirical success. However, there exist few works that investigate federated learning algorithms in the Riemannian setting. In particular, critical challenges such as partial participation and data heterogeneity among agents are not explored in the Riemannian federated setting. This paper presents and analyzes a Riemannian FL algorithm, called RFedAGS, based on a new efficient server aggregation---averaging gradient streams, which can simultaneously handle partial participation and data heterogeneity. We theoretically show that the proposed RFedAGS has global convergence and sublinear convergence rate under decaying step sizes cases; and converges sublinearly/linearly to a neighborhood of a stationary point/solution under fixed step sizes cases. These analyses are based on a vital and non-trivial assumption induced by partial participation, which is shown to hold with high probability. Extensive experiments conducted on synthetic and real-world data demonstrate the good performance of RFedAGS.
Developmental Federated Tuning: A Cognitive-Inspired Paradigm for Efficient LLM Adaptation
Yebo Wu ⋅ Jingguang Li ⋅ Zhijiang Guo ⋅ Li Li
Federated fine-tuning enables Large Language Models (LLMs) to adapt to downstream tasks while preserving data privacy, but its resource-intensive nature severely limits deployment on edge devices. In this paper, we introduce Developmental Federated Tuning (DevFT), a resource-efficient approach inspired by cognitive development that progressively builds a powerful LLM from a compact foundation. DevFT decomposes the fine-tuning process into developmental stages, each optimizing a submodel with increasing parameter capacity. Knowledge acquired in earlier stages is transferred to subsequent submodels, providing optimized initialization parameters that prevent convergence to local minima and accelerate training. This paradigm mirrors human learning, gradually constructing a comprehensive knowledge structure while refining existing skills. To efficiently build stage-specific submodels, DevFT introduces deconfliction-guided layer grouping and differential-based layer fusion to distill essential information and construct representative layers. Evaluations across multiple benchmarks demonstrate that DevFT significantly outperforms state-of-the-art methods, achieving up to $\textbf{4.59$\times$}$ faster convergence, $\textbf{10.67$\times$}$ reduction in communication overhead, and $\textbf{9.07}$\% average performance improvement, while maintaining compatibility with existing federated fine-tuning approaches.
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li ⋅ Zhenyu Zhang ⋅ Zhengang Wang ⋅ Pengfei chen ⋅ Pengfei Zheng
Prevailing LLM (Large Language Model) serving engines employ expert parallelism (EP) to implement multi-device inference of massive Mixture-of-Experts (MoE) models. However, the efficiency of expert parallel inference is largely bounded by inter-device communication, as EP embraces expensive all-to-all collectives to route tokens to the remote experts if not collocating on the same GPU/NPU device. Nevertheless, state-of-the-art schemes treat expert device-placement and request (or token) device-scheduling as separate concerns, triggering excessive communication between them and compromising inference efficiency This paper proposes Semantic Parallelism, a novel parallelism paradigm that minimizes the steep communication costs in EP-centric MoE serving via model-data collaborative scheduling. We implement Semantic Parallelism in a framework called Sem-MoE. Sem-MoE maximally collocates experts and their activating tokens onto the same device using proactively modeled activation likelihood between them and introduces three key techniques: (1) Offline model scheduling, which preliminarily clusters and collocates experts onto devices based on their co-activation tendencies for certain classes of input. (2) Online inter-request data scheduling for Attention-DP setups, which proactively rebatches incoming requests onto the device that hosts experts most likely and frequently activated by the corresponding requests. (3) Online intra-request data scheduling for Attention-TP setups, which seamlessly fuses a token reshuffling procedure into the original inference pipeline and proactively reschedules tokens to devices to reduce dispersed remote routing. We build Sem-MoE into a prevailing LLM serving engine SGLANG. Experiments show our collaborative scheduling approach can effectively reduce the all-to-all communication volume in EP and achieve superior inference throughput compared to existing solutions.
Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence
Shuhua Yu ⋅ Dusan Jakovetic ⋅ Soummya Kar
Heavy-tailed noise in nonconvex stochastic optimization has garnered increasing research interest, as empirical studies, including those on training attention models, suggest it is a more realistic gradient noise condition. This paper studies first-order nonconvex stochastic optimization under heavy-tailed gradient noise in a decentralized setup, where each node can only communicate with its direct neighbors in a predefined graph. Specifically, we consider a class of heavy-tailed gradient noise that is zero-mean and has only $p$-th moment for $p \in (1, 2]$. We propose GT-NSGDm, Gradient Tracking based Normalized Stochastic Gradient Descent with momentum, that utilizes normalization, in conjunction with gradient tracking and momentum, to cope with heavy-tailed noise on distributed nodes. We show that, when the communication graph admits primitive and doubly stochastic weights, GT-NSGDm guarantees, for the first time in the literature, that the expected gradient norm converges at an optimal non-asymptotic rate $O\big(1/T^{(p-1)/(3p-2)}\big)$, which matches the lower bound in the centralized setup. When the tail index $p$ is unknown, GT-NSGDm attains a non-asymptotic rate $O\big( 1/T^{(p-1)/(2p)} \big)$ that is, for $p < 2$, topology independent and has a speedup factor $n^{1-1/p}$ in terms of the number of nodes $n$. Finally, experiments on nonconvex linear regression with tokenized synthetic data and decentralized training of language models on a real-world corpus demonstrate that GT-NSGDm is more robust and efficient than baselines.
Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity
Artavazd Maranjyan ⋅ Peter Richtarik
Asynchronous stochastic gradient methods are central to scalable distributed optimization, particularly when devices differ in computational capabilities. Such settings arise naturally in federated learning, where training takes place on smartphones and other heterogeneous edge devices. In addition to varying computation speeds, these devices often hold data from different distributions. However, existing asynchronous SGD methods struggle in such heterogeneous settings and face two key limitations. First, many rely on unrealistic assumptions of similarity across workers' data distributions. Second, methods that relax this assumption still fail to achieve theoretically optimal performance under heterogeneous computation times. We introduce Ringleader ASGD, the first asynchronous SGD algorithm that attains the theoretical lower bounds for parallel first-order stochastic methods in the smooth nonconvex regime, thereby achieving optimal time complexity under data heterogeneity and without restrictive similarity assumptions. Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary and even time-varying worker computation speeds, closing a fundamental gap in the theory of asynchronous optimization.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
Ali Saheb Pasand ⋅ Elvis Dohmatob
Grokking is the phenomenon whereby, unlike the training performance which peaks very early on during training, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems like modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method removes the plateaus.
OmniCVR: A Benchmark for Omni-Composed Video Retrieval with Vision, Audio, and Text
Junyang Ji ⋅ Shengjun Zhang ⋅ Da Li ⋅ Yuxiao Luo ⋅ Yan Wang ⋅ Di Xu ⋅ Biao Yang ⋅ Wei Yuan ⋅ Fan Yang ⋅ Zhihai He ⋅ Wenming Yang
Composed video retrieval presents a complex challenge: retrieving a target video based on a source video and a textual modification instruction. This task demands fine-grained reasoning over multimodal transformations. However, existing benchmarks predominantly focus on vision–text alignment, largely overlooking the rich semantic signals embedded in audio—such as speech, music, and environmental sounds—which are often decisive for comprehensive video understanding. To bridge this gap, we introduce OmniCVR, a large-scale benchmark for omni-composed video retrieval that establishes vision, audio, and text as first-class modalities. OmniCVR is constructed via a scalable, automated pipeline integrating content-aware segmentation, omni-modal annotation, and a rigorous dual-validation protocol involving both large language models and human experts. The benchmark comprises vision-centric, audio-centric, and integrated queries, with the latter forming the majority to accurately reflect real-world multimodal complexity. Furthermore, we propose AudioVLM2Vec, an audio-aware extension of VLM2Vec. By incorporating explicit audio semantics, AudioVLM2Vec achieves state-of-the-art performance, highlighting fundamental limitations in the audio reasoning capabilities of current multimodal retrieval systems.
Scaling Direct Feedback Learning with Jacobian Alignment Guarantees
Paul Caillon ⋅ Erwan Fagnou ⋅ Blaise Delattre ⋅ Alexandre Allauzen
Deep neural networks rely on backpropagation (BP) for optimization, but its strictly sequential backward pass hinders parallelism and scalability. Direct Feedback Alignment (DFA) has been proposed as a promising approach for parallel learning of deep neural networks, relying on fixed random projections to enable layer-wise parallel updates, but fails on deep convolutional networks, and performs poorly on modern transformer architectures. We introduce GrAPE (Gradient-Aligned Projected Error), a hybrid feedback-alignment method that (i) estimates rank-1 Jacobians via forward-mode JVPs and (ii) aligns each layer’s feedback matrix by minimizing a local cosine-alignment loss. To curb drift in very deep models, GrAPE performs infrequent BP anchor steps on a single mini-batch, preserving mostly parallel updates. We show that the forward-gradient estimator has strictly positive expected cosine with the true Jacobian. We relate this estimator-level guarantee to a standard stochastic-approximation result under a positive expected-cosine condition on the update direction, providing theoretical support for GrAPE’s alignment objective. Empirically, GrAPE consistently outperforms prior alternatives to BP, enabling the training of modern architectures, closing a large fraction of the gap to BP while retaining layer-parallel updates for the vast majority of steps.
COSMOS: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models
Liming Liu ⋅ Zhenghao Xu ⋅ Zixuan Zhang ⋅ Hao Kang ⋅ Zichong Li ⋅ Chen Liang ⋅ Weizhu Chen ⋅ Tuo Zhao
Large Language Models (LLMs) have demonstrated remarkable success across various domains, yet their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit. While adaptive optimizers such as AdamW are widely used, they suffer from critical limitations, including an inability to capture interdependencies between coordinates and high memory consumption. Subsequent research, exemplified by SOAP, attempts to better capture coordinate interdependence but incurs greater memory overhead, limiting scalability for massive LLMs. An alternative approach aims to reduce memory consumption through low-dimensional projection, but these methods lose the gradient information in the residual space, resulting in less effective optimization. In this paper, we propose COSMOS, a novel hybrid optimizer that leverages the varying importance of eigensubspaces in the gradient matrix to achieve memory efficiency without compromising optimization performance. The design of COSMOS is motivated by our empirical insights and practical considerations. Specifically, COSMOS applies SOAP to the leading eigensubspace, which captures the primary optimization dynamics, and MUON to the remaining eigensubspace, which is less critical but computationally expensive to handle with SOAP. This hybrid strategy significantly reduces memory consumption while maintaining robust optimization performance, making it particularly suitable for massive LLMs. Numerical experiments on various datasets and transformer architectures are provided to demonstrate the effectiveness of COSMOS.
Robust Training of Neural Networks at Arbitrary Precision and Sparsity
Chengxi Ye ⋅ Grace Chu ⋅ Yanfeng Liu ⋅ Yichi Zhang ⋅ Lukasz Lew ⋅ Li Zhang ⋅ Mark Sandler ⋅ Andrew Howard
The discontinuous operations inherent in quantization and sparsification introduce a long-standing obstacle to backpropagation, particularly in ultra-low precision and sparse regimes. While the community has long viewed quantization as unfriendly to gradient descent due to its lack of smoothness, we pinpoint—for the first time—that the key issue is the absence of a proper gradient path that allows training to learn robustness to quantization noise. The standard Straight-Through Estimator (STE) exacerbates this with its well-understood mismatch: a quantization-aware forward pass but oblivious backward pass, leading to unmanaged error and instability. We solve this by explicitly modeling quantization as additive noise, making the full forward-backward path well-defined without heuristic gradient estimation. As one natural solution, we introduce a denoising dequantization transform derived from a principled ridge regression objective, creating an explicit, corrective gradient path that makes learning robust to the noise STE bypasses. We extend this to sparsification by treating it as a special form of quantization that zeros out small values. Our unified framework trains models at arbitrary precisions and sparsity levels with off-the-shelf recipes, enabling stable A1W1 and sub-1-bit networks where others falter. It yields state-of-the-art results, mapping efficiency frontiers for modern LLMs and providing a theoretically grounded path to hyper-efficient neural networks.
Neural Multi-Objective Combinatorial Optimization for Flexible Job Shop Scheduling Problems
Igor Smit ⋅ Yaoxin Wu ⋅ Pavel Troubil ⋅ Yingqian Zhang ⋅ Wim P.M. Nuijten
Neural combinatorial optimization (NCO) has made significant advances in applying deep learning techniques to efficiently and effectively solve single-objective flexible job shop scheduling problems (FJSPs). However, the more practical multi-objective FJSPs (MOFJSPs) remain underexplored, limiting the applicability of NCO in multi-criteria decision-making scenarios. In this paper, we propose a decomposition-based NCO method to solve MOFJSPs. We present the dual conditional attention network (DCAN), a neural network architecture that takes the objective preferences along with the problem instance, aiming to learn adaptable policies over the preferences. By decomposing an MOFJSP into a set of subproblems with different preferences, the learned DCAN policies generate a set of solutions that reflect the corresponding trade-offs. We customize the Proximal Policy Optimization algorithm based on decomposition to effectively train the policy network for multiple objectives and define the state and reward based on combinations of different objectives. Extensive results showcase that our approach outperforms traditional multi-objective optimization methods and generalizes well across diverse types of problem instances.
T-TAMER: Provably Taming Trade-offs in ML Serving
Yuanyuan Yang ⋅ Ruimin Zhang ⋅ Jamie Morgenstern ⋅ Haifeng Xu
As machine learning models continue to grow in size and complexity, efficient serving faces increasingly broad trade-offs spanning accuracy, latency, resource usage, and other objectives. Multi-model serving further complicates these trade-offs; for example, in cascaded models, each early-exit decision balances latency reduction against potential accuracy loss. Despite the pervasiveness and importance of such trade-offs, current strategies remain largely heuristic and case-specific, limiting both their theoretical guarantees and general applicability. We present a general framework, T-Tamer, which formalizes this setting as a multi-stage decision process, where the objective is to determine both when to exit and which model to consult. Our main result shows that recall (i.e., the ability to revisit earlier models) is both necessary and sufficient for achieving provable performance guarantees. In particular, we prove that strategies without recall cannot obtain any constant-factor approximation to the optimal trade-off, whereas recall-based strategies provably attain the optimal trade-off in polynomial time. We validate our analysis through experiments on synthetic datasets and early-exit workloads for vision and NLP benchmarks. The results show that recall-based strategies consistently yield efficient accuracy–latency trade-offs. We hope this work provides a principled foundation for bridging heuristic practice with theoretical guarantees in the design of early-exit and cascaded models.
Redirection for Erasing Memory (REM): Towards a universal unlearning method for corrupted data
Stefan Schoepf ⋅ Michael Mozer ⋅ Nicole Mitchell ⋅ Alexandra Brintrup ⋅ Georgios Kaissis ⋅ Peter Kairouz ⋅ Eleni Triantafillou
Machine unlearning is studied for a multitude of tasks, but specialization of unlearning methods to particular tasks has made their systematic comparison challenging. To address this issue, we propose a conceptual space to characterize diverse corrupted data unlearning tasks in vision classifiers. This space is described by two dimensions, the discovery rate (the fraction of the corrupted data that are known at unlearning time) and the statistical regularity of the corrupted data (from random exemplars to shared concepts). Methods proposed previously have been targeted at portions of this space and, as we show, fail predictably outside these regions. We propose Redirection for Erasing Memory (REM), whose key feature is that corrupted data are redirected to dedicated neurons introduced at unlearning time and then discarded or deactivated to suppress the influence of corrupted data. REM performs strongly across the space of tasks, in contrast to prior SOTA methods that fail outside the regions for which they were designed.
Training Large Language Models To Reason In Parallel With Global Forking Tokens
Sheng Jia ⋅ Xiao Wang ⋅ Shiva Kasiviswanathan
Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using bipartite matching between global forking tokens and unique reasoning traces. We observe that whereas naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Global Forking Policy Optimization (GFPO) leverages these maximally steerable tokens to incentivize complex reasoning, and the resulting models consistently outperform their SFT counterparts with GRPO on both math reasoning and execution-based code generation benchmarks.
Test-Time Scaling with Reflective Generative Model
Zixiao Wang ⋅ Yuxin Wang ⋅ Xiaorui Wang ⋅ Mengting Xing ⋅ Jie Gao ⋅ Jianjun Xu ⋅ Guangcan Liu ⋅ Chenhui Jin ⋅ Zhuo Wang ⋅ Shengzhuo zhang ⋅ Hongtao Xie
We introduce a new Reflective Generative Model (RGM), which obtains OpenAI o3-mini's performance via a novel Reflective Generative Form. This form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 50M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model (SPRM), which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, RGM is naturally suitable for test-time scaling based on the controllable thinking length. Experiments show that our RGM, equipped with only 50M additional parameters in SPRM, outperforms policy models with 72B extra reward models, thereby enabling 32B model to outperform OpenAI o3-mini on AIME24 (84.2 vs. 79.6) and HMMT25 (53.1 vs. 53.0). Code is available at https://github.com/MetaStone-AI/XBai-o4.
We introduce a highly expressive class of function approximators called Splat Regression Models. Model outputs are mixtures of heterogeneous and anisotropic bump functions, termed splats, each weighted by an output vector. The power of splat modeling lies in its ability to locally adjust the scale and direction of each splat, achieving both high interpretability and accuracy. Fitting splat models reduces to optimization over the space of mixing measures, which can be implemented using Wasserstein-Fisher-Rao gradient flows. As a byproduct, we recover the popular Gaussian Splatting methodology as a special case, providing a unified theoretical framework for this state-of-the-art technique that clearly disambiguates the inverse problem, the model, and the optimization algorithm. Through numerical experiments, we demonstrate that the resulting models and algorithms constitute a flexible and promising approach for solving diverse approximation, estimation, and inverse problems involving low-dimensional data.
Deep Latent Variable Model based Vertical Federated Learning with Flexible Alignment and Labeling Scenarios
KIHUN HONG ⋅ Sejun Park ⋅ Ganguk Hwang
Federated learning (FL) has attracted significant attention for enabling collaborative learning without exposing private data. Among the primary variants of FL, vertical federated learning (VFL) addresses feature-partitioned data held by multiple institutions, each holding complementary information for the same set of users. However, existing VFL methods often impose restrictive assumptions such as a small number of participating parties, fully aligned data, or only using labeled data. In this work, we reinterpret alignment gaps in VFL as missing data problems and propose a unified framework that accommodates both training and inference under arbitrary alignment and labeling scenarios, while supporting diverse missingness mechanisms. In the experiments on 168 configurations spanning four benchmark datasets, six training-time missingness patterns, and seven testing-time missingness patterns, our method outperforms all baselines in 160 cases with an average gap of 9.6 percentage points over the next-best competitors. To the best of our knowledge, this is the first VFL framework to jointly handle arbitrary data alignment, unlabeled data, and multi-party collaboration all at once.
LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen ⋅ Michael Qizhe Shieh ⋅ Lidong Bing
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding—the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.
FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing
Ran Elbaz ⋅ Guy Bar-Shalom ⋅ Yam Eitan ⋅ Fabrizio Frasca ⋅ Haggai Maron
Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.
C-Voting: Confidence-Based Test-Time Voting without Explicit Energy Functions
Kenji Kubo ⋅ Shunsuke Kamiya ⋅ Masanori Koyama ⋅ Kohei Hayashi ⋅ Yusuke Iwasawa ⋅ Yutaka Matsuo
Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model’s confidence. Additionally, it yields $4.9\\%$ higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C‑voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme ($95.2\\%$ vs. $55.0\\%$) and Maze ($78.6\\%$ vs. $74.5\\%$) tasks.
Breaking the Correlation Plateau: On the Optimization and Capacity Limits of Attention-Based Regressors
Jingquan Yan ⋅ Yuwei Miao ⋅ Peiran Yu ⋅ Junzhou Huang
Attention-based regression models are often trained by jointly optimizing Mean Squared Error (MSE) loss and Pearson correlation coefficient (PCC) loss, emphasizing the magnitude of errors and the order or shape of targets, respectively. A common but poorly understood phenomenon during training is the PCC plateau: PCC stops improving early in training, even as MSE continues to decrease. We provide the first rigorous theoretical analysis of this behavior, revealing fundamental limitations in both optimization dynamics and model capacity. First, in regard to the flattened PCC curve, we uncover a critical conflict where lowering MSE (magnitude matching) can paradoxically suppress the PCC gradient (shape matching). This issue is exacerbated by the softmax attention mechanism, particularly when the data to be aggregated is highly homogeneous. Second, we identify a limitation in the model capacity: we derived a PCC improvement limit for any convex aggregator (including the softmax attention), showing that the convex hull of the inputs strictly bounds the achievable PCC gain. We demonstrate that data homogeneity intensifies both limitations. Motivated by these insights, we propose the Extrapolative Correlation Attention (ECA), which incorporates novel, theoretically-motivated mechanisms to improve the PCC optimization and extrapolate beyond the convex hull. Across diverse benchmarks, including challenging homogeneous data setting, ECA consistently breaks the PCC plateau, achieving significant improvements in correlation without compromising MSE performance.
Disentangling Length Bias in Preference Learning via Response-Conditioned Modeling
Jianfeng Cai ⋅ Jinhua Zhu ⋅ Ruopei Sun ⋅ Yue Wang ⋅ Li Li ⋅ Wengang Zhou ⋅ Houqiang Li
Reinforcement Learning from Human Feedback (RLHF) has achieved considerable success in aligning large language models (LLMs) by modeling human preferences with a learnable reward model and employing a reinforcement learning algorithm to maximize the reward model's scores. However, these reward models are susceptible to exploitation through various superficial confounding factors, with length bias emerging as a particularly significant concern. Moreover, while the pronounced impact of length bias on preference modeling suggests that LLMs possess an inherent sensitivity to length perception, our preliminary investigations reveal that fine-tuned LLMs consistently struggle to adhere to explicit length instructions. To address these two limitations, we propose a novel framework wherein the reward model explicitly differentiates between human semantic preferences and response length requirements. Specifically, we introduce a $\textbf{R}$esponse-$\textbf{c}$onditioned $\textbf{B}$radley-$\textbf{T}$erry (Rc-BT) model that enhances the model's capability in length bias mitigating and length instruction following, through training on our augmented dataset. Furthermore, we propose the Rc-RM and Rc-DPO algorithm to leverage the Rc-BT model for reward modeling and direct policy optimization (DPO) of LLMs, simultaneously mitigating length bias and promoting adherence to length instructions. Extensive experiments across various models and datasets demonstrate the effectiveness and generalizability of our approach.
Poly-attention: a general scheme for higher-order self-attention
Sayak Chakrabarti ⋅ Toniann Pitassi ⋅ Josh Alman
The self-attention mechanism, at the heart of the transformer model, is able to effectively model pairwise interactions between tokens. However, numerous recent works have shown that it is unable to perform basic tasks involving detecting triples of correlated tokens, or compositional tasks where multiple input tokens need to be referenced to generate a result. Some higher-dimensional alternatives to self-attention have been proposed to address this, including higher-order attention (Sanford et al., 2023) and Strassen attention (Kozachinskiy et al., 2025), which can perform some of these polyadic tasks in exchange for slower, superquadratic running times. In this work, we define a vast class of generalizations of self-attention, which we call poly-attention mechanisms. Our mechanisms can incorporate arbitrary higher-order (tensor) computations as well as arbitrary relationship structures between the input tokens, and they include the aforementioned alternatives as special cases. We then systematically study their computational complexity and representational strength, including giving new algorithms and matching complexity-theoretic lower bounds on the time complexity of computing the attention matrix exactly as well as approximately, and tightly determining which polyadic tasks they can each perform. Our results give interesting tradeoffs between different desiderata for these mechanisms, including a tight relationship between how expressive a mechanism is, and how large the coefficients in the model may be so that the mechanism can be approximated in almost-linear time. Notably, we give a new attention mechanism which can be computed exactly in quadratic time, and which can perform function composition for any fixed number of functions. Prior mechanisms, even for just composing two functions, could only be computed in superquadratic time, and our new lower bounds show that faster algorithms for them are not possible.
What Layers When: Learning to Skip Compute in LLMs with Residual Gates
Filipe Laitenberger ⋅ Dawid Kopiczko ⋅ Cees G Snoek ⋅ Yuki Asano
We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that compresses the branch’s output before it re-enters the residual stream. During inference we rank tokens by the gate and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15% compute while retaining >90% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
Samples Are Not Equal: A Sample Selection Approach for Deep Clustering
Zhengxing Jiao ⋅ Yaxin Hou ⋅ Jun Ma ⋅ Yuhang Li ⋅ Ding Ding ⋅ Yuheng Jia ⋅ Hui LIU ⋅ Junhui Hou
Deep clustering has recently achieved remarkable progress across various domains. However, existing clustering methods typically treat all samples equally, neglecting the inherent differences in their feature patterns and learning states. Such redundant learning often drives models to overemphasize simple feature patterns in high-density regions, weakening their ability to capture complex yet diverse ones in low-density regions. To address this issue, we propose a novel plug-in designed to mitigate overfitting to simple and redundant feature patterns while encouraging the learning of more complex yet diverse ones. Specifically, we introduce a density-aware clustering head initialization strategy that adaptively adjusts each sample's contribution to cluster prototypes according to its local density in the feature space. This strategy mitigates the bias towards high-density regions and encourages a more comprehensive attention on medium- and low-density ones. Furthermore, we design a dynamic sample selection strategy that evaluates the learning state of samples based on the feature consistency and pseudo-label stability. By removing sufficiently learned samples and prioritizing unstable ones, this strategy adaptively reallocates training resources, enabling the model to consistently focus on samples that remain under-learned throughout training. Our method can be integrated as a plug-in into a wide range of deep clustering architectures. Extensive experiments on multiple benchmark datasets demonstrate that our method improves clustering accuracy by up to $\textbf{6.1}$\% and enhances training efficiency by up to $\textbf{1.3$\times$}$. Code is available at [https://github.com/notoaudrey/Samples-Are-Not-Equal](https://github.com/notoaudrey/Samples-Are-Not-Equal).
CLIP-FMoE: Scalable CLIP via Fused Mixture-of-Experts with Enforced Specialization
Luong Tran ⋅ Lan-Cuong Nguyen ⋅ Huynh Dang Nguyen ⋅ Nguyen Cong Dat ⋅ Dung D. Le ⋅ Van Nguyen
Mixture-of-Experts (MoE) architectures have emerged as a promising approach for scaling deep learning models while maintaining computational efficiency. However, existing MoE adaptations for Contrastive Language-Image Pre-training (CLIP) models suffer from significant computational overhead during sequential training and degradation of zero-shot capabilities. To address these limitations, we propose CLIP-FMoE, a novel approach that integrates MoE architecture into CLIP fine-tuning. Our method uses Isolated Constrained Contrastive Learning, a pipeline that trains specialized experts on cluster-based data partitions to accelerate expert specialization. Additionally, we introduce a Fusion Gate mechanism to mitigate catastrophic forgetting of pre-trained knowledge. Extensive experiments across multiple benchmarks demonstrate that our approach achieves consistent improvements on downstream tasks while preserving zero-shot capabilities. Furthermore, our method demonstrates robust performance across varying context lengths, making it particularly suitable for diverse real-world applications.
UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
Zihao Huang ⋅ Yu Bao ⋅ Qiyang Min ⋅ Siyan Chen ⋅ Ran Guo ⋅ Hongzhi Huang ⋅ Defa Zhu ⋅ Banggu Wu ⋅ Yutao Zeng ⋅ zhou Xun ⋅ Siyuan Qiao
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.
Enhancing Multivariate Time Series Forecasting with Global Temporal Retrieval
Fanpu Cao ⋅ Lu Dai ⋅ Jindong Han ⋅ Hui Xiong
Multivariate time series forecasting (MTSF) plays a vital role in numerous real-world applications, yet existing models remain constrained by their reliance on a limited historical context. This limitation prevents them from effectively capturing global periodic patterns that often span cycles significantly longer than the input horizon—despite such patterns carrying strong predictive signals. Naïve solutions, such as extending the historical window, lead to severe drawbacks, including overfitting, prohibitive computational costs, and redundant information processing. To address these challenges, we introduce the Global Temporal Retriever (GTR), a lightweight and plug-and-play module designed to extend any forecasting model’s temporal awareness beyond the immediate historical context. GTR maintains an adaptive global temporal embedding of the entire cycle and dynamically retrieves and aligns relevant global segments with the input sequence. By jointly modeling local and global dependencies through a 2D convolution and residual fusion, GTR effectively bridges short-term observations with long-term periodicity without altering the host model architecture. Extensive experiments on six real-world datasets demonstrate that GTR consistently delivers state-of-the-art performance across both short-term and long-term forecasting scenarios, while incurring minimal parameter and computational overhead. These results highlight GTR as an efficient and general solution for enhancing global periodicity modeling in MTSF tasks. Code is available at this repository: https://github.com/macovaseas/GTR.
The Lattice Geometry of Neural Network Quantization: A Short Equivalence Proof of GPTQ and Babai's Algorithm
Johann Birnick
We explain how data-driven quantization of a linear unit in a neural network corresponds to solving the closest vector problem for a certain lattice generated by input data. We prove that the GPTQ algorithm is equivalent to Babai's well-known nearest-plane algorithm. We furthermore provide geometric intuition for both algorithms. Lastly, we note the consequences of these results, in particular hinting at the possibility of using lattice basis reduction for improved quantization.
Let's (not) just put things in Context: Test-time Training for Long-context LLMs
Rachit Bansal ⋅ Aston Zhang ⋅ Rishabh Tiwari ⋅ Lovish Madaan ⋅ Venkata Sai Surya Subramanyam Duvvuri ⋅ Devvrit Khatri ⋅ David Brandfonbrener ⋅ David Alvarez-Melis ⋅ Prajjwal Bhargava ⋅ Mihir Kale ⋅ Samy Jelassi
Progress on training and architecture strategies has enabled LLMs with millions of tokens in context length. However, empirical evidence suggests that such long-context LLMs can consume far more text than they can reliably use. On the other hand, it has been shown that inference-time compute can be used to scale performance of LLMs, often by generating thinking tokens, on challenging tasks involving multi-step reasoning. Through controlled experiments on sandbox long-context tasks, we find that such inference-time strategies show rapidly diminishing returns and fail at long context. We attribute these failures to score dilution, a phenomenon inherent to static self-attention. Further, we show that current inference-time strategies cannot retrieve relevant long-context signals under certain conditions. We propose a simple method that, through targeted gradient updates on the given context, provably overcomes limitations of static self-attention. We find that this shift in how inference-time compute is spent leads to consistently large performance improvements across models and long-context benchmarks. Our method leads to large 12.6 and 14.1 percentage point improvements for Qwen3-4B on average across subsets of LongBench-v2 and ZeroScrolls benchmarks. The takeaway is practical: for long context, a small amount of context-specific training is a better use of inference compute than current inference-time scaling strategies like producing more thinking tokens.
Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database
Weizhi Fei ⋅ Hao Shi ⋅ Jing Xu ⋅ Jingchen Peng ⋅ Jiazheng Li ⋅ Jingzhao Zhang ⋅ Bo Bai ⋅ Wei Han ⋅ zy chen ⋅ Xueyan Niu
Efficiently editing knowledge stored in Large Language Models (LLMs) enables model updates without large-scale training. One promising solution is Locate-and-Edit (L\&E), allowing simultaneous modifications of a massive number of factual knowledge. However, such editing may compromise the general abilities of LLMs and even result in forgetting edited facts when scaling up to thousands of edits. In this paper, we model existing linear L\&E methods as querying a Key-Value (KV) database. From this perspective, we then propose NeuralDB, an editing framework that explicitly represents the edited facts as a neural KV database equipped with a non-linear gated retrieval module. With simple modification over L\&E methods, our framework not only significantly extends the capacity of knowledge editing but also eliminates the associated side effects. Comprehensive experiments involving the editing of 10,000 facts were conducted on the ZsRE and CounterFact datasets, including GPT2-XL, GPT-J (6B) and Llama-3 (8B). The results demonstrate that NeuralDB excels in all metrics of editing success while maintaining original performance evaluated by six representative text understanding and generation tasks. Further experiments indicate that NeuralDB maintains its effectiveness even when scaled to 100,000 facts (\textbf{50}$\mathbf{\times}$ more than in prior work).
Massive Editing for Large Language Models Based on Dynamic Weight Generation
Wentao Wan ⋅ Qiqing Lao ⋅ Zhiwei Xie ⋅ Hefeng Wu ⋅ Runnan Lin ⋅ Liang Lin ⋅ Keze Wang
Knowledge Editing (KE) is a field that studies how to modify some knowledge in Large Language Models (LLMs) at a low cost (compared to pre-training). Currently, performing large-scale edits on LLMs while ensuring the Reliability, Generality, and Locality metrics of the edits remain a challenge. This paper proposes a Massive editing approach for LLMs based on dynamic weight Generation (MeG). Our MeG involves attaching a dynamic weight neuron to specific layers of the LLMs and using a diffusion model to conditionally generate the weights of this neuron based on the input query required for the knowledge. This allows the use of adding a single dynamic weight neuron to achieve the goal of large-scale knowledge editing. Experiments show that our MeG can significantly improve the performance of large-scale KE in terms of Reliability, Generality, and Locality metrics compared to existing knowledge editing methods, particularly with a high percentage point increase in the absolute value index for the Locality metric, demonstrating the advantages of our proposed method.
Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
Shangpin Peng ⋅ Weinong Wang ⋅ Zhuotao Tian ⋅ Senqiao Yang ⋅ Xing W ⋅ Haotian Xu ⋅ Chengquan Zhang ⋅ Takashi Isobe ⋅ Baotian Hu ⋅ Min Zhang
Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based methods typically treat all preference pairs equally, overlooking substantial variations in data quality and learning difficulty, which leads to inefficient data utilization and suboptimal performance. To address this limitation, we propose Uni-DPO, a unified dynamic preference optimization framework that jointly considers (a) the inherent quality of preference pairs and (b) the model's evolving performance during training. By adaptively reweighting samples based on both factors, Uni-DPO enables more effective use of preference data and achieves superior performance. Extensive experiments across models and benchmarks demonstrate the effectiveness and generalization of Uni-DPO. On textual tasks, Gemma-2-9B-IT fine-tuned with Uni-DPO surpasses the leading LLM, Claude 3 Opus, by 6.7 points on Arena-Hard. On mathematical and multimodal tasks, Uni-DPO consistently outperforms baseline methods across all benchmarks, providing strong empirical evidence of its effectiveness and robustness.
TokenSeek: Memory Efficient Fine Tuning via Instance-Aware Token Ditching
Runjia Zeng ⋅ Qifan Wang ⋅ Qiang Guan ⋅ Ruixiang Tang ⋅ Lifu Huang ⋅ Zhenting Wang ⋅ Xueling Zhang ⋅ Cheng Han ⋅ Dongfang Liu
Fine tuning has been regarded as a de facto approach for adapting large language models (LLMs) to downstream tasks, but the high training memory consumption inherited from LLMs makes this process inefficient. Among existing memory efficient approaches, activation-related optimization has proven particularly effective, as activations consistently dominate overall memory consumption. Although prior arts offer various activation optimization strategies, their data-agnostic nature ultimately results in ineffective and unstable fine tuning. In this paper, we propose TokenSeek, a universal plugin solution for various transformer-based models through instance-aware token seeking and ditching, achieving significant fine-tuning memory savings (e.g., requiring only 14.8% of the memory on Llama3.2 1B) with on-par or even better performance. Furthermore, our interpretable token seeking process reveals the underlying reasons for its effectiveness, offering valuable insights for future research on token efficiency. Homepage: runjia.tech/iclr_tokenseek.
Textual Equilibrium Propagation for Deep Compound AI Systems
Minghui Chen ⋅ Wenlong Deng ⋅ James Y Zou ⋅ Han Yu ⋅ Xiaoxiao Li
Large language models (LLMs) are increasingly deployed as part of compound AI systems which coordinate multiple modules (e.g., retrievers, tools, verifiers) over long-horizon workflows. Although recent frameworks that propagate textual feedback globally (e.g., TextGrad make it feasible to optimize such pipelines, we identify two depth-scaling failure modes in long-horizon agentic workflows: 1) exploding textual gradient, where textual feedback grows exponentially with depth, leading to prohibitively long message and amplifies evaluation biases; and 2) vanishing textual gradient, where limited long-context ability causes models overemphasize recent or early feedback, while compression of lengthy feedback causes downstream messages to lose specificity gradually as they propagate many hops upstream. To mitigate these issues, we introduce Textual Equilibrium Propagation (TEP), a local learning principle inspired by Equilibrium Propagation in energy-based models. TEP includes two phases: 1) a free phase where a local LLM critics iteratively refine prompts until reaching equilibrium (no further improvements are suggested); and 2) a nudged phase which applies proximal prompt edits with bounded modification intensity, using task-level objectives that propagate via forward signaling rather than backward feedback chains. This design supports local prompt optimization followed by controlled adaptation toward global goals without the computational burden and signal degradation of global textual backpropagation. Across long-horizon QA benchmarks and multi-agent tool-use dataset, TEP consistently improves accuracy and efficiency over global propagation methods such as TextGrad, with gains that increase at greater depths, while preserving the practicality of black-box LLM components in deep compound AI system.
Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin
Enrique Queipo-de-Llano ⋅ Alvaro Arroyo ⋅ Federico Barbero ⋅ Xiaowen Dong ⋅ Michael Bronstein ⋅ Yann LeCun ⋅ Ravid Shwartz-Ziv
Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M--120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation validates our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.
Dr.LLM: Dynamic Layer Routing in LLMs
Ahmed Heakl ⋅ Martin Gubri ⋅ Salman Khan ⋅ Sangdoo Yun ⋅ Seong Joon Oh
Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.
dLLM - Rethinking Generation Beyond Autoregressive Models
Suhas Pai ⋅ Xiaojun Ren
Diffusion large language models (dLLMs) have emerged as a promising alternative to standard autoregressive (AR) Transformers, offering parallel token generation and flexible infilling instead of strict left-to-right decoding. This post walks through how masked discrete diffusion works at a high level: a forward process that randomly masks tokens, a reverse process that iteratively denoises them, and training setups that either start from scratch or adapt existing AR models. We then discuss how diffusion decoding differs from AR decoding, including its strengths for infilling, structured generation, and long-horizon planning, as well as practical challenges around length control, number of denoising steps, and blockwise generation. Finally, we examine both sides of the current evidence: where dLLMs shine in data-constrained regimes and where theoretical and empirical work suggests they can underperform or even collapse back toward autoregressive behavior, ending with some possible hybrid futures that combine diffusion for reasoning and autoregression for generation.
Towards Improved Sentence Representations using Token Graphs
Krishna Sri Ipsit Mantri ⋅ Carola-Bibiane Schönlieb ⋅ Zorah Lähner ⋅ Moshe Eliasof
Obtaining a single-vector representation from a Large Language Model's (LLM) token-level outputs is a critical step for nearly all sentence-level tasks. However, standard pooling methods like mean or max aggregation treat tokens as an independent set, discarding the rich relational structure captured by the model's self-attention layers and making them susceptible to signal dilution. To address this, we introduce GLOT, a lightweight, structure-aware pooling module that reframes pooling as relational learning followed by aggregation. Operating on the outputs of a frozen LLM, GLOT first constructs a latent token-similarity graph, then refines token representations with a graph neural network, and finally aggregates them using a readout layer. Experimentally, our approach is remarkably robust and efficient: on a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse. Furthermore, it is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods. Supported by a theoretical analysis of its expressive power, our work shows that learning over token graphs is a powerful paradigm for the efficient adaptation of frozen LLMs. Our code is published at https://github.com/ipsitmantri/GLOT.
In Context Semi-Supervised Learning
Jiashuo Fan ⋅ Paul Rosu ⋅ Aaron Wang ⋅ Lawrence Carin ⋅ Xiang Cheng
There has been significant recent interest on understanding the capacity of Transformers for in-context learning (ICL), yet most theory focuses on supervised settings with explicitly labeled pairs. In practice, Transformers often perform well even when labels are sparse or absent, suggesting crucial structure within unlabeled contextual demonstrations. We introduce and study in-context semi-supervised learning (IC-SSL), where a small set of labeled examples is accompanied by many unlabeled points, and show that Transformers can leverage the unlabeled context to learn a robust, context-dependent representation. This representation enables accurate predictions and markedly improves performance in low-label regimes, offering foundational insights into how Transformers exploit unlabeled context for representation learning within the ICL framework. Our code is available at https://github.com/Jason-fan20/ICL_Semi.
Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models
Junhyuck Kim ⋅ Ethan Ewer ⋅ Taehong Moon ⋅ Jongho Park ⋅ Dimitris Papailiopoulos
While 4-bit quantization has emerged as a memory-optimal choice for non-reasoning models and zero-shot tasks across scales, we show that this universal prescription fails for reasoning models, where KV cache rather than model size can dominate memory. Through systematic experiments on mathematical, code generation, and knowledge-intensive reasoning tasks, we find a scale-dependent trade-off: models with an effective size below 8-bit 4B parameters achieve better accuracy by allocating memory to larger weights, rather than longer generation, while larger models benefit from the opposite strategy. This scale threshold also determines when parallel scaling becomes memory-efficient and whether KV cache eviction outperforms KV quantization. Our findings show that memory optimization for LLMs cannot be scale-agnostic, while providing principled guidelines: for small reasoning models, prioritize model capacity over test-time compute, while for large ones, maximize test-time compute. Our results suggest that optimizing reasoning models for deployment requires fundamentally different strategies than those established for non-reasoning ones.
Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression
Yifei Zuo ⋅ Yutong Yin ⋅ Zhichen Zeng ⋅ Ang Li ⋅ Banghua Zhu ⋅ Zhaoran Wang
Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight—even at greater computational cost—has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $\Theta(n^2d)$ and $\Theta(nd^2)$ complexity. We then introduce {FlashLLA}, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models.
RACE Attention: A Strictly Linear-Time Attention for Long-Sequence Training
Sahil Joshi ⋅ Agniva Chowdhury ⋅ Amar Kanakamedala ⋅ Ekam Singh ⋅ Evan Tu ⋅ Anshumali Shrivastava
Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward–backward pass of a single attention layer once the context exceeds $\sim 4$ million tokens on an NVIDIA GH200 (96 GB). We introduce **R**epeated **A**rrays-of-**C**ount **E**stimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and \emph{soft} Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to $64$K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon® Gold 5220R CPU in a single forward–backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today’s hardware.
Hyper-SET: Designing Transformers via Hyperspherical Energy Minimization
Yunzhe Hu ⋅ Difan Zou ⋅ Dong Xu
Transformer-based models have achieved remarkable success, but their core components, Transformer layers, are largely heuristics-driven and engineered from the bottom up, calling for a prototypical model with high interpretability and practical competence. To this end, we conceptualize a principled, top-down approach grounded in energy-based interpretation. Specifically, we formalize token dynamics as a joint maximum likelihood estimation on the hypersphere, featuring two properties: semantic alignment in the high-dimensional space and distributional uniformity in the low-dimensional space. By quantifying them with extended Hopfield energy functions, we instantiate this idea as a constrained energy minimization problem, which enables designs of symmetric attention and feedforward modules with RMS normalization. We further present Hyper-Spherical Energy Transformer (Hyper-SET), a recurrent-depth alternative to vanilla Transformers naturally emerging from iterative energy optimization on the hypersphere. With shared parameters across layers, Hyper-SET can scale to arbitrary depth with fewer parameters. Theoretically grounded and compact, it achieves competitive or superior performance across diverse tasks, including Sudoku solving, image classification, and masked image modeling. We also design novel variations under the proposed general principle, such as linear attention and gated feedforward layer, and showcase its scalability with depth-wise LoRA. Our results highlight Hyper-SET as a step toward interpretable and principled Transformer design. Code is availabel at https://github.com/huyunzhe/hyper-set.
Draft-based Approximate Inference for LLMs
Kevin Galim ⋅ Ethan Ewer ⋅ Wonjun Kang ⋅ Minjae Lee ⋅ Hyung Koo ⋅ Kangwook Lee
Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.
Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme
Mikhail Persiianov ⋅ Jiawei Chen ⋅ Petr Mokrov ⋅ Alexander Tyurin ⋅ Evgeny Burnaev ⋅ Aleksandr Korotin
Learning population dynamics involves recovering the underlying process that governs particle evolution, given evolutionary snapshots of samples at discrete time points. Recent methods frame this as an energy minimization problem in probability space and leverage the celebrated JKO scheme for efficient time discretization. In this work, we introduce iJKOnet, an approach that combines the JKO framework with inverse optimization techniques to learn population dynamics. Our method relies on a conventional end-to-end adversarial training procedure and does not require restrictive architectural choices, e.g., input-convex neural networks. We establish theoretical guarantees for our methodology and demonstrate improved performance over prior JKO-based methods.
Logit‑KL Flow Matching: Non‑Autoregressive Text Generation via Sampling‑Hybrid Inference
Egor Sevriugov ⋅ Nikita Dragunov ⋅ Anton Razzhigaev ⋅ Andrey Kuznetsov ⋅ Ivan Oseledets
Non-autoregressive (NAR) language models offer notable efficiency in text generation by circumventing the sequential bottleneck of autoregressive decoding. However, accurately modeling dependencies in discrete sequences remains challenging in this paradigm. In this work, we advance the field of NAR generation by applying conditional flow matching (CFM) methods grounded in geometrically principled interpolation, specifically leveraging Kullback-Leibler (KL) divergence geodesics, which correspond to linear interpolation in logit space. We rigorously establish that maximizing conditional likelihood in this setting precisely recovers the flow matching velocity field, supplying the theoretical justification for this approach in sequence modeling. To address practical performance gaps of \emph{basic} inference, we propose a novel empirical \emph{sampling} strategy that iteratively denoises and re-noises, along with a \emph{hybrid} scheme that integrates our \emph{sampling} method with \emph{basic} procedure. Across unconditional and conditional text and code infilling, the approach improves perplexity and downstream metrics over prior NAR baselines under matched settings.
The Spacetime of Diffusion Models: An Information Geometry Perspective
Rafał Karczewski ⋅ Markus Heinonen ⋅ Alison Pouplin ⋅ Søren Hauberg ⋅ Vikas Garg
We present a novel geometric perspective on the latent space of diffusion models. We first show that the standard pullback approach, utilizing the deterministic probability flow ODE decoder, is fundamentally flawed. It provably forces geodesics to decode as straight segments in data space, effectively ignoring any intrinsic data geometry beyond the ambient Euclidean space. Complementing this view, diffusion also admits a stochastic decoder via the reverse SDE, which enables an information geometric treatment with the Fisher-Rao metric. However, a choice of $\mathbf{x}_T$ as the latent representation collapses this metric due to memorylessness. We address this by introducing a latent spacetime $\mathbf{z}=(\mathbf{x}_t,t)$ that indexes the family of denoising distributions $p(\mathbf{x}_0 | \mathbf{x}_t)$ across all noise scales, yielding a nontrivial geometric structure. We prove these distributions form an exponential family and derive simulation-free estimators for curve lengths, enabling efficient geodesic computation. The resulting structure induces a principled Diffusion Edit Distance, where geodesics trace minimal sequences of noise and denoise edits between data. We also demonstrate benefits for transition path sampling in molecular systems, including constrained variants such as low-variance transitions and region avoidance. Code is available at: https://github.com/rafalkarczewski/spacetime-geometry.
Discrete Bayesian Sample Inference for Graph Generation
Ole Petersen ⋅ Marcel Kollovieh ⋅ Marten Lienen ⋅ Stephan Günnemann
Generating graph-structured data is crucial in applications such as molecular generation, knowledge graphs, and network analysis. However, their discrete, unordered nature makes them difficult for traditional generative models, leading to the rise of discrete diffusion and flow matching models. In this work, we introduce GraphBSI, a novel one-shot graph generative model based on Bayesian Sample Inference (BSI). Instead of evolving samples directly, GraphBSI iteratively refines a belief over graphs in the continuous space of distribution parameters, naturally handling discrete structures. Further, we state BSI as a stochastic differential equation (SDE) and derive a noise-controlled family of SDEs that preserves the marginal distributions via an approximation of the score function. Our theoretical analysis further reveals the connection to Bayesian Flow Networks and Diffusion models. Finally, in our empirical evaluation, we demonstrate state-of-the-art performance on molecular and synthetic graph generation, outperforming existing one-shot graph generative models on the standard benchmarks Moses and GuacaMol.
TEDM: Time Series Forecasting with Elucidated Diffusion Models
Edgardo Solano Carrillo ⋅ Sreerag Vadakkemeppully Naveenachandran ⋅ Julia Niebling
Score-based generative modeling through differential equations has driven breakthroughs in high-fidelity image synthesis, offering modular model design and efficient sampling. However, this success has not been widely translated to timeseries forecasting yet. This gap stems from the sequential nature of time series, in contrast to the unordered structure of images. Here, we extend the theoretical formulation used for images to explicitly address sequential structures. We propose a diffusion-based forecasting framework (TEDM) that adapts score estimation to temporal settings and elucidates its design space. Such a design allows empirical computation of noise and signal scaling directly from data, avoiding external schedules. Notably, this reduces sampling complexity to linear in the forecast horizon. Without elaborate preprocessing, TEDM sets new state-of-the-art results on multiple forecasting benchmarks. These results illustrate the growing potential of diffusion models beyond vision. TEDM generates low-latency forecasts using a lightweight architecture, making it ideal for real-time deployment.
Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
Boshi Wang ⋅ Huan Sun
Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we hypothesize two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. Our research opens up the broader fundamental challenge of designing models capable of learning systematic conceptual binding with less human scaffolding.
What Exactly Does Guidance Do in Masked Discrete Diffusion Models
Ye He ⋅ Kevin Rojas ⋅ Molei Tao
Masked discrete diffusion models have been gaining popularity recently, and classifier-free guidance, just like its continuous counterpart, has been proposed to enable efficacious conditional generation by discrete diffusion. To quantify the precise effect of discrete guidance, this article considers masked discrete diffusion with arbitrary data distribution in low dimension, so that the distribution that guided masked discrete diffusion samples from, as well as the sampling dynamics, can be analytically and exactly quantified and interpreted. When the full data distribution is a mixture over classes and the goal is to sample from a specific class, guidance amplifies class-specific regions while suppresses regions shared with other classes. This effect depends on the guidance strength $w$ and induces distinct covariance structures in the sampled distribution. Notably, we observe quantitatively different behaviors in $1$D and $2$D. We also show that for large $w$, the decay rate of the total variation ($\text{TV}$) along the reverse dynamics is double-exponential in $w$ for both $1$D and $2$D. These findings highlight the role of guidance, not just in shaping the output distribution, but also in controlling the dynamics of the sampling trajectory. Our theoretical analysis is supported by experiments that illustrate the geometric effects of guidance and its impact on convergence.
The possibility of LLM self-awareness and even sentience is gaining increasing public attention and has major safety and policy implications, but the science of measuring them is still in a nascent state. Here we introduce a novel methodology for quantitatively evaluating metacognitive abilities in LLMs. Taking inspiration from research on metacognition in nonhuman animals, our approach eschews model self-reports and instead tests to what degree models can strategically deploy knowledge of internal states. Using two experimental paradigms, we demonstrate that frontier LLMs introduced since early 2024 show increasingly strong evidence of certain metacognitive abilities, specifically the ability to assess and utilize their own confidence in their ability to answer factual and reasoning questions correctly and the ability to anticipate what answers they would give and utilize that information appropriately. We buttress these behavioral findings with an analysis of the token probabilities returned by the models, which suggests the presence of an upstream internal signal that could provide the basis for metacognition. We further find that these abilities 1) are limited in resolution, 2) emerge in context-dependent manners, and 3) seem to be qualitatively different from those of humans. We also report intriguing differences across models of similar capabilities, suggesting that LLM post-training may have a role in developing metacognitive abilities.
DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights
Saumya Gupta ⋅ Scott Biggs ⋅ Moritz Laber ⋅ Zohair Shafi ⋅ Robin Walters ⋅ Ayan Paul
Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present \ours, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by \ours do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.
Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy
Zeju LI ⋅ Jianyuan Zhong ⋅ Ziyang Zheng ⋅ Xiangyu Wen ⋅ Zhijian Xu ⋅ Yingying Cheng ⋅ Fan Zhang ⋅ Qiang Xu
Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned without significant degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.
COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma ⋅ Rakshit Trivedi
Activation steering methods enable inference-time control of large language model (LLM) behavior without retraining, but current approaches face a fundamental trade-off: sample-efficient methods suboptimally capture steering signals from labeled examples, while methods that better extract these signals require hundreds to thousands of examples. We introduce COLD-Steer, a training-free framework that steers LLM activations by approximating the representational changes that would result from gradient descent on in-context examples. Our key insight is that the effect of fine-tuning on a small set of examples can be efficiently approximated at inference time without actual parameter updates. We formalize this through two complementary approaches: (i) a unit kernel approximation method that updates the activations directly using gradients with respect to them, normalized across examples, and (ii) a finite-difference approximation requiring only two forward passes regardless of example count. Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95\% steering effectiveness while using 50 times fewer samples compared to the best baseline. COLD-Steer enables real-time adaptation to new steering objectives and facilitates accommodating diverse perspectives without extensive demonstration data, which we validate through our experiments on pluralistic alignment tasks. Our framework opens new possibilities for adaptive, context-aware model control that can flexibly address varying loss-driven human preferences through principled approximation of learning dynamics rather than specialized training procedures.
KV Cache Transform Coding for Compact Storage in LLM Inference
Konrad Staniszewski ⋅ Adrian Lancucki
Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20x compression while maintaining reasoning and long-context accuracy, and 40x or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.
FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo ⋅ Zhuolin Jiang ⋅ Md Zahid Hasan ⋅ Yan Chen ⋅ Soumalya Sarkar
We propose **FROST**, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of *reasoning outliers* and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model’s reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-oss-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average **69.68%** reduction in token usage and a **26.70%** improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm $\||\mathbf{x}\||_{\infty}$ by **15.97%** and the average kurtosis by **91.09%** compared to the base model.
InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning
Matthew Yang ⋅ Hao Bai ⋅ Ian Wu ⋅ Gene Yang ⋅ Amrith Setlur ⋅ Aviral Kumar
Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.
EAMET: ROBUST MASSIVE MODEL EDITING VIA EMBEDDING ALIGNMENT OPTIMIZATION
Yanbo Dai ⋅ Zhenlan Ji ⋅ Zongjie Li ⋅ Shuai Wang
Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics. Their robustness is also limited in context-rich settings or when editing multiple facts of the same subject simultaneously. We attribute these failures to the embedding misalignment among knowledge items, which undermines editing reliability at scale. To address this, we propose EAMET (Embedding Alignment Model Editing in Transformers), which addresses this issue by aligning the space of key and residual embeddings. Extensive experiments across six LLMs and three datasets demonstrate that EAMET consistently outperforms existing methods, achieving about 90\% editing efficacy when editing 10k facts.
Flow-based Conformal Prediction for Multi-dimensional Time Series
Junghwan Lee ⋅ Chen Xu ⋅ Yao Xie
Time series prediction underpins a broad range of downstream tasks across many scientific domains. Recent advances and increasing adoption of black-box machine learning models for time series prediction highlight the critical need for uncertainty quantification. While conformal prediction has gained attention as a reliable uncertainty quantification method, conformal prediction for time series faces two key challenges: (1) \textbf{leveraging correlations in observations and non-conformity scores to overcome the exchangeability assumption}, and (2) \textbf{constructing prediction sets for multi-dimensional outcomes}. To address these challenges, we propose a novel conformal prediction method for time series using flow with classifier-free guidance. We provide coverage guarantees by establishing exact non-asymptotic marginal coverage and a finite-sample bound on conditional coverage for the proposed method. Evaluations on real-world time series datasets demonstrate that our method constructs significantly smaller prediction sets than existing conformal prediction methods, maintaining target coverage.
Score-based diffusion models have achieved remarkable progress in various domains with the ability to generate new data samples that do not exist in the training set. In this work, we study the hypothesis that such creativity arises from an interpolation effect caused by a smoothing of the empirical score function. Focusing on settings where the training set lies uniformly in a one-dimensional subspace, we probe the interplay between score smoothing and the denoising dynamics with analytical solutions and numerical experiments. In particular, we demonstrate how a smoothed score function can lead to the generation of samples that interpolate among the training data within their subspace while avoiding a full memorization. Moreover, we present theoretical and empirical evidence that learning score functions with neural networks - either with or without explicit regularization - can indeed achieve a similar effect as score smoothing, including when the data belongs to simple nonlinear manifolds.
Universal Multi-Domain Translation via Diffusion Routers
Duc Kieu ⋅ Kien Do ⋅ Tuan Hoang ⋅ Thao Le ⋅ Tung Kieu ⋅ Dang Nguyen ⋅ Thin Nguyen
Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.
Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching
Yidan Xu ⋅ Yixin Wang ⋅ XuanLong Nguyen
Flow matching is a powerful approach for high-fidelity density estimation, but it often fails to capture the latent structure of complex data. Probabilistic models like variational autoencoders (VAEs), on the other hand, learn structured representations but underperform in sample quality. We propose Structured Flow Autoencoders (SFA), a family of probabilistic models that augments graphical models with conditional continuous normalizing flow (CNF) likelihoods, enabling flow-matching-based structured representation learning. At the core of SFA is a novel flow matching objective that explicitly accounts for latent variables, allowing joint learning of the CNF likelihood and posterior. SFA applies broadly to graphical models with continuous and mixture latents, as well as latent dynamical systems. Empirical studies across image, video, and RNA-seq data show that SFA consistently outperforms VAEs and their structured extensions in generation quality, representation utility, and scalability to large datasets. Compared to generative models like latent flow matching (LatentFM), SFA also produces more diverse samples, suggesting better coverage of the data distribution.
Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
Mengni Jia ⋅ Mengyu Zhou ⋅ Yihao Liu ⋅ xiaoxi jiang ⋅ guanjunjiang
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from **inherently** much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. Currently, there has been no theoretical explanation or systematic solution. In this paper, we derive **the first decomposition** of MDM training variance into three sources: {A} masking pattern noise, {B} masking rate noise, and {C} data noise -- while ARMs are only affected by {C}. This cleanly explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a **Pareto-optimal** $t$-sampler that minimizes training variance by sampling harder $t$ values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce {A}. Experiments show that, compared to standard MDM training, our methods improve accuracy by **7–8\%** on complex reasoning tasks, while simultaneously reducing run-to-run variability to **near ARM levels**, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline method runs remain below the worst run of our method.
Contact Wasserstein Geodesics for Non-Conservative Schrödinger Bridges
Andrea Testa ⋅ Søren Hauberg ⋅ Tamim Asfour ⋅ Leonel Rozo
The Schrödinger Bridge provides a principled framework for modeling stochastic processes between distributions; however, existing methods are limited by energy-conservation assumptions, which constrains the bridge's shape preventing it from model varying-energy phenomena. To overcome this, we introduce the non-conservative generalized Schrödinger bridge (NCGSB), a novel, energy-varying reformulation based on contact Hamiltonian mechanics. By allowing energy to change over time, the NCGSB provides a broader class of real-world stochastic processes, capturing richer and more faithful intermediate dynamics. By parameterizing the Wasserstein manifold, we lift the bridge problem to a tractable geodesic computation in a finite-dimensional space. Unlike computationally expensive iterative solutions, our contact Wasserstein geodesic (CWG) is naturally implemented via a ResNet architecture and relies on a non-iterative solver with near-linear complexity. Furthermore, CWG supports guided generation by modulating a task-specific distance metric. We validate our framework on tasks including manifold navigation, molecular dynamics predictions, and image generation, demonstrating its practical benefits and versatility.
There is No VAE: End-to-End Pixel-Space Generative Modeling via Self-Supervised Pre-Training
jiachen lei ⋅ Keli Liu ⋅ Julius Berner ⋅ Y HoiM ⋅ Hongkai Zheng ⋅ Jiahong Wu ⋅ Xiangxiang Chu
Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our framework achieves state-of-the-art (SOTA) performance on ImageNet. Specifically, our diffusion model reaches an FID of 1.58 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE) surpassing prior pixel-space methods and VAE-based counterparts by a large margin in both generation quality and training efficiency. In a direct comparison, our model significantly outperforms DiT while using only around 30\% of its training compute. Furthermore, our consistency model achieves an impressive FID of 8.82 on ImageNet-256, significantly outperforming its latent-space counterparts. This marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models. Our codes are available at: \href{https://github.com/AMAP-ML/EPG}{https://github.com/AMAP-ML/EPG}
Reasoning Scaffolding: Distilling the Flow of Thought from LLMs
Xiangyu Wen ⋅ Junhua Huang ⋅ Zeju LI ⋅ Min Li ⋅ Jianyuan Zhong ⋅ Zhijian Xu ⋅ Mingxuan Yuan ⋅ Yongxiang Huang ⋅ Qiang Xu
The prevailing approach to distilling reasoning from Large Language Models (LLMs)—behavioral cloning from textual rationales—is fundamentally limited. It teaches Small Language Models (SLMs) to mimic surface-level patterns rather than the underlying algorithmic structure of thought, resulting in a critical lack of logical robustness. We argue that instead of cloning text, distillation should transfer this algorithmic structure directly. We introduce Reasoning Scaffolding, a framework that reframes reasoning as a structured generation process. Our method first abstracts the teacher's thought process into a sequence of discrete, interpretable semantic signals (e.g., Contrast, Addition) that act as a scaffold. The student model is then trained via a multi-task objective to both (1) predict the next semantic signal, anticipating the reasoning flow, and (2) generate the corresponding step, conditioned on that signal. This multi-task scheme acts as a powerful regularizer, compelling the student to internalize the computational patterns of coherent reasoning. On a suite of challenging reasoning benchmarks, our method significantly outperforms state-of-the-art distillation in both accuracy and logical consistency, providing a path towards creating smaller models that are genuine reasoners, not just fluent mimics.
Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression
Ziyuan Zhang ⋅ Yichong Xia ⋅ Bin Chen ⋅ Tianwei Zhang ⋅ Hao Wang ⋅ Han Qiu
Generative models have demonstrated significant results in ultra-low bitrate image compression, owing to their powerful capabilities for content generation and texture completion. Existing works primarily based on diffusion models still face challenges such as limited bitrate adaptability and high computational complexity for encoding and decoding. Inspired by the success of Visual AutoRegressive model (VAR), we introduce AutoRegressive-based Progressive Coding (ARPC) for ultra-low bitrate image compression, a progressive image compression framework based on next-scale prediction visual autoregressive model. Based on multi-scale residual vector quantizer, ARPC efficiently encodes the image into multi-scale discrete token maps and controls the bitrates by selecting different scales for transmission. For decompression, ARPC leverages the prior knowledge inherent in the visual autoregressive model to predict the unreceived scales, which is naturally the autoregressive generation process. To further increase the compression ratio, we target the VAR as a probability estimator for lossless entropy coding and propose group-masked bitwise multi-scale residual quantizer to adaptively allocate bits for different scales. Extensive experiments show that ARPC achieves state-of-the-art perceptual fidelity at ultra-low bitrates and high decompression efficiency compared with existing diffusion-based methods.
Train Once, Answer All: Many Pretraining Experiments for the Cost of One
Sebastian Bordt ⋅ Martin Pawelczyk
Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying the relationship between training data and large language model (LLM) behavior. However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a single training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model's training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.
SAFETY-GUIDED FLOW (SGF): A UNIFIED FRAMEWORK FOR NEGATIVE GUIDANCE IN SAFE GENERATION
Mingyu Kim ⋅ Young-Heon Kim ⋅ Mijung Park
Safety mechanisms for diffusion and flow models have recently been developed along two distinct paths. In robot planning, control barrier functions are employed to guide generative trajectories away from obstacles at every denoising step by explicitly imposing geometric constraints. In parallel, recent data-driven, negative guidance approaches have been shown to suppress harmful content and promote diversity in generated samples. However, they rely on heuristics without clearly stating when safety guidance is actually necessary. In this paper, we first introduce a unified probabilistic framework using a Maximum Mean Discrepancy (MMD) potential for image generation tasks that recasts both Shielded Diffusion and Safe Denoiser as instances of our energy-based negative guidance against unsafe data samples. Furthermore, we leverage control-barrier functions analysis to justify the existence of a critical time window in which negative guidance must be strong; outside of this window, the guidance should decay to zero to ensure safe and high-quality generation. We evaluate our unified framework on several realistic safe generation scenarios, confirming that negative guidance should be applied in the early stages of the denoising process for successful safe generation.
SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism
Yuhao Shen ⋅ Junyi Shen ⋅ Quan Kong ⋅ Tianyu Liu ⋅ Yao Lu ⋅ Cong Wang
Speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small, efficient draft model to propose draft tokens in advance, and subsequently validating them in parallel with the large target model. However, the existing SD methods still remain fundamentally constrained by their serialized execution, which inevitably causes mutual waiting bubbles between the draft and target models. To address this critical challenge, we draw inspiration from sophisticated branch prediction mechanisms in modern processors and propose a novel framework, \textbf{SpecBranch}, to fully unlock branch parallelism in SD. Specifically, we first conduct an in-depth analysis of the potential of branch parallelism in SD, and recognize that the key challenge lies in the intricate trade-offs between parallelization and token rollback. Based on this analysis, we introduce parallel speculative branches to preemptively hedge against likely rejections. Meanwhile, to significantly enhance parallelism, we jointly orchestrate adaptive draft lengths with a hybrid combination of the implicit draft model confidence and explicit reusing of target model features. Extensive experiments conducted across various models and benchmarks show that \textbf{SpecBranch} achieves impressive speedups of over \textbf{1.8}$\times \sim$ \textbf{4.5}$\times$ against the standard auto-regressive decoding and reduces rollback tokens by \textbf{50}\% for poorly aligned models, while maintaining an identical sampling distribution. Our code is available at \url{https://github.com/Sylvan820/Specbranch}.
Dual-Path Condition Alignment for Diffusion Transformers
Changhao Peng ⋅ Yuqi Ye ⋅ Shuangjun Du ⋅ Wenxu Gao ⋅ Wei Gao
Denoising-based generative models have been significantly advanced by representation-alignment (REPA) loss, which leverages pre-trained visual encoders to guide intermediate network features. However, REPA's reliance on external visual encoders introduces two critical challenges: potential \textit{distribution mismatches} between the encoder's training data and the generation target, and the high \textit{computational costs} of pre-training. Inspired by the observation that REPA primarily aids early layers in capturing robust semantics, we propose an unsupervised alternative that avoids external visual encoder and the assumption of consistent data distribution. We introduce \textit{\textbf{DU}al-\textbf{P}ath condition \textbf{A}lignment} (\textbf{DUPA}), a novel self-alignment framework, which independently noises an image multiple times and processes these noisy latents through decoupled diffusion transformer, then aligns the derived conditions\textemdash low-frequency semantic features extracted from each path. Experiments demonstrate that DUPA achieves FID$=$1.46 on ImageNet 256$\times$256 with only 400 training epochs, outperforming all methods that do not rely on external supervision. DUPA is also model-agnostic and can be readily applied to any denoising-based generative model, showcasing its excellent scalability and generalizability. Code is available at https://github.com/PCH-gg/DUPA, https://openi.pcl.ac.cn/OpenAIDriving/DUPA.
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Austin Xu ⋅ Xuan-Phi Nguyen ⋅ Yilun Zhou ⋅ Chien-Sheng Wu ⋅ Caiming Xiong ⋅ Shafiq Joty
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1\% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality
MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE
Geng Zhang ⋅ Han Yuxuan ⋅ Yuxuan Lou ⋅ Yiqi Zhang ⋅ Wangbo Zhao ⋅ Yang You
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices—unbiased estimations of their original outputs—minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25\% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at \url{https://github.com/zxgx/mode-pd}.
FlowBind: Efficient Any-to-Any Generation with Bidirectional Flows
Yeonwoo Cha ⋅ Semin Kim ⋅ Jinhyeon Kwon ⋅ Seunghoon Hong
Any-to-any generation seeks to translate between arbitrary subsets of modalities, enabling flexible cross-modal synthesis. Despite recent success, existing flow-based approaches are challenged by their inefficiency, as they require large-scale datasets often with restrictive pairing constraints, incur high computational cost from modeling joint distribution, and rely on complex multi-stage training. We propose FlowBind, an efficient framework for any-to-any generation. Our approach is distinguished by its simplicity: it learns a shared latent space capturing cross-modal information, with modality-specific invertible flows bridging this latent to each modality. Both components are optimized jointly under a single flow-matching objective, and at inference the invertible flows act as encoders and decoders for direct translation across modalities. By factorizing interactions through the shared latent, FlowBind naturally leverages arbitrary subsets of modalities for training, and achieves competitive generation quality while substantially reducing data requirements and computational cost. Experiments on text, image, and audio demonstrate that FlowBind attains comparable quality while requiring up to 6× fewer parameters and training 10× faster than prior methods. The project page with code is available at https://yeonwoo378.github.io/official_flowbind.
Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
Chunyang Jiang ⋅ Yonggang Zhang ⋅ Yiyang Cai ⋅ Chi-Min Chan ⋅ Yulong Liu ⋅ Mingming Chen ⋅ Wei Xue ⋅ Yike Guo
The rising cost of acquiring supervised data has driven significant interest in self-improvement for large language models (LLMs). Straightforward unsupervised signals like majority voting have proven effective in generating pseudo-labels for verifiable tasks, while their applicability to unverifiable tasks (e.g., translation) is limited by the open-ended character of responses. As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels. However, self-evaluation relying on LLMs typically incurs high computational overhead and introduces overconfidence issues due to intrinsic biases. To address these challenges, we propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement. Inspired by majority voting commonly employed in verifiable tasks, we propose semantic voting as a novel mechanism that relaxes the principle of hard matching (i.e., exact matching) toward soft matching (i.e., semantic similarity). Soft matching is achieved by leveraging a lightweight sentence embedding model to quantify semantic similarity, thereby mitigating excessive computational burden and intrinsic bias-associated limitations of self-evaluation. Comprehensive experiments demonstrate that our method achieves substantial gains in computational efficiency and overall better performance than self-evaluation methods across diverse model architectures and tasks.
Delay Flow Matching
Bolin Zhao ⋅ Xiaoyu Zhang ⋅ Yuting Dong ⋅ Xin Lu ⋅ Wei Lin ⋅ Qunxi Zhu
Flow matching (FM) based on Ordinary Differential Equations (ODEs) has achieved significant success in generative tasks. However, it faces several inherent limitations, including an inability to model trajectory intersections, capture delay dynamics, and handle transfer between heterogeneous distributions. These limitations often result in a significant mismatch between the modeled transfer process and real-world phenomena, particularly when key coupling or inherent structural information between distributions must be preserved. To address these issues, we propose Delay Flow Matching (DFM), a new FM framework based on Delay Differential Equations (DDEs). Theoretically, we show that DFM possesses universal approximation capability for continuous transfer maps. By incorporating delay terms into the vector field, DFM enables trajectory intersections and better captures delay dynamics. Moreover, by designing appropriate initial functions, DFM ensures accurate transfer between heterogeneous distributions. Consequently, our framework preserves essential coupling relationships and achieves more flexible distribution transfer strategies. We validate DFM's effectiveness across synthetic datasets, single-cell data, and image-generation tasks.
SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion
Sedjro Salomon Hotegni ⋅ Sebastian Peitz
Developing efficient multi-objective optimization methods to compute the Pareto set of optimal compromises between conflicting objectives remains a key challenge, especially for large-scale and expensive problems. To bridge this gap, we introduce SPREAD, a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs). SPREAD first learns a conditional diffusion process over points sampled from the decision space and then, at each reverse diffusion step, refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence alongside a Gaussian RBF–based repulsion term for diversity. Empirical results on multi-objective optimization benchmarks, including offline and Bayesian surrogate-based settings, show that SPREAD matches or exceeds leading baselines in efficiency, scalability, and Pareto front coverage. Code is available at https://github.com/safe-autonomous-systems/moo-spread .
Quotient-Space Diffusion Models
Yixian Xu ⋅ Yusong Wang ⋅ Shengjie Luo ⋅ Kaiyuan Gao ⋅ Tianyu He ⋅ Di He ⋅ Chang Liu
Diffusion-based generative models have reformed generative AI, and have enabled new capabilities in the science domain, for example, generating 3D structures of molecules. Due to the intrinsic problem structure of certain tasks, there is often a symmetry in the system, which identifies objects that can be converted by a group action as equivalent, hence the target distribution is essentially defined on the quotient space with respect to the group. In this work, we establish a formal framework for diffusion modeling on a general quotient space, and apply it to molecular structure generation which follows the special Euclidean group SE(3) symmetry. The framework reduces the necessity of learning the component corresponding to the group action, hence simplifies learning difficulty over conventional group-equivariant diffusion models, and the sampler guarantees recovering the target distribution, while heuristic alignment strategies lack proper samplers. The arguments are empirically validated on structure generation for small molecules and proteins, indicating that the principled quotient-space diffusion model provides a new framework that outperforms previous symmetry treatments.
Open Data Synthesis for Deep Research
Ziyi Xia ⋅ Kun Luo ⋅ Hongjin Qian ⋅ Siqi Bao ⋅ Zheng Liu
Deep research becomes increasingly important as people seek to solve complex problems that require gathering and synthesizing information from diverse sources. A key capability in this process is agentic search, where an LLM-agent iteratively retrieves relevant information across multiple sources while performing multi-step reasoning. However, developing effective agentic search systems is challenging due to the lack of high-quality training data that reflects the complexity of real-world research tasks. To address this gap, we introduce InfoSeek, a novel data synthesis framework that conceptualizes agentic search as a Hierarchical Constraint Satisfaction Problem (HCSP), where solving a task requires satisfying layered constraints across multiple levels of sub-problems. InfoSeek employs a Diffusion–Retrospection process: in the diffusion phase, the framework expands outward from a seed webpage, generating constraints that connect to neighboring pages and forming an exploration tree; in the retrospection phase, a subtree is sampled and backtracking constraints are introduced, which are then blurred and integrated into an HCSP instance. As a generic framework, InfoSeek can be easily extended to other domains beyond web, facilitating ad-hoc optimization of deep research. To our knowledge, InfoSeek is the first publicly released framework in this area, complete with open-source code and well-curated datasets. Extensive experiments on diverse information-seeking benchmarks show that training on InfoSeek-generated data substantially improves agentic search performance, delivering significantly larger gains than traditional datasets across diverse model backends and training strategies, thereby validating the effectiveness of our approach.
pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation
Hansheng Chen ⋅ Kai Zhang ⋅ Hao Tan ⋅ Leonidas Guibas ⋅ Gordon Wetzstein ⋅ Sai Bi
Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality--diversity trade-off. To address this, we propose policy-based flow models ($\pi$-Flow). $\pi$-Flow modifies the output layer of a student flow model to predict a network-free policy at one timestep. The policy then produces dynamic flow velocities at future substeps with negligible overhead, enabling fast and accurate ODE integration without extra network evaluations. To match the policy's ODE trajectory to the teacher's, we introduce a novel imitation distillation approach, which matches the policy's velocity to the teacher's along the policy's trajectory using a standard $\ell_2$ flow matching loss. By simply mimicking the teacher's behavior, $\pi$-Flow enables stable and scalable training and avoids the quality--diversity trade-off. On ImageNet $256\times 256$, it attains a 1-NFE FID of 2.85, outperforming previous 1-NFE models of the same DiT architecture. On FLUX.1-12B and Qwen-Image-20B at 4 NFEs, $\pi$-Flow achieves substantially better diversity than state-of-the-art DMD models, while maintaining teacher-level quality.
Generalised Flow Maps for Few-Step Generative Modelling on Riemannian Manifolds
Oscar Davis ⋅ Michael Albergo ⋅ Nicholas Boffi ⋅ Michael Bronstein ⋅ Joey Bose
Geometric data and purpose-built generative models on them have become ubiquitous in high-impact deep learning application domains, ranging from protein backbone generation and computational chemistry to geospatial data. Current geometric generative models remain computationally expensive at inference---requiring many steps of complex numerical simulation---as they are derived from dynamical measure transport frameworks such as diffusion and flow-matching on Riemannian manifolds. In this paper, we propose Generalised Flow Maps (GFM), a new class of few-step generative models that generalises the Flow Map framework in Euclidean spaces to arbitrary Riemannian manifolds. We instantiate GFMs with three self-distillation-based training methods: Generalised Lagrangian Flow Maps, Generalised Eulerian Flow Maps, and Generalised Progressive Flow Maps. We theoretically show that GFMs, under specific design decisions, unify and elevate existing Euclidean few-step generative models, such as consistency models, shortcut models, and meanflows, to the Riemannian setting. We benchmark GFMs against other geometric generative models on a suite of geometric datasets, including geospatial data, RNA torsion angles, and hyperbolic manifolds, and achieve state-of-the-art sample quality for single- and few-step evaluations, and superior or competitive log-likelihoods using the implicit probability flow.
Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Ming Gui ⋅ Johannes Schusterbauer ⋅ Timy Phan ⋅ Felix Krause ⋅ Joshua Susskind ⋅ MIGUEL ANGEL BAUTISTA ⋅ Björn Ommer
We introduce Representation Tokenizer (RepTok), a generative modeling framework that represents an image using a single continuous latent token obtained from self-supervised vision transformers. Building on a pre-trained SSL encoder, we fine-tune only the semantic token embedding and pair it with a generative decoder trained end-to-end using a standard flow matching objective. This adaptation enriches the token with low-level, reconstruction-relevant details, enabling faithful image reconstruction. To preserve the favorable geometry of the original SSL space, we add a cosine-similarity loss that regularizes the adapted token, ensuring it remains smooth and suitable for generation.Our single-token formulation resolves the spatial redundancies of the 2D latent space, simplifies architectures, and significantly reduces training costs. Despite its simplicity and efficiency, RepTok achieves competitive results on class-conditional ImageNet generation and extends naturally to text-to-image synthesis, reaching competitive zero-shot performance on MS-COCO under extremely limited training budgets. Our findings highlight the potential of fine-tuned SSL representations as compact and effective latent spaces for efficient generative modeling. We will release our model to facilitate further research.
Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data
Rishabh Ranjan ⋅ Valter Hudovernik ⋅ Mark Znidar ⋅ Charilaos Kanatsoulis ⋅ Roshan Upendra ⋅ Mahmoud Mohammadi ⋅ Joe Meyer ⋅ Tom Palczewski ⋅ Carlos Guestrin ⋅ Jure Leskovec
Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) incorporates task specification via task table prompting, (ii) tokenizes cells with table/column metadata, (iii) is pretrained via masked token prediction, and (iv) utilizes a novel Relational Attention mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 93% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experimental analyses show that RT's zero-shot transfer leverages task context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data. Code, models, data: https://github.com/snap-stanford/relational-transformer.
Graph homophily booster: Reimagining the role of discrete features in heterophilic graph learning
Ruizhong Qiu ⋅ Ting-Wei Li ⋅ Gaotang Li ⋅ Hanghang Tong
Graph neural networks (GNNs) have emerged as a powerful tool for modeling graph-structured data, demonstrating remarkable success in many real-world applications such as complex biological network analysis, neuroscientific analysis, and social network analysis. However, existing GNNs often struggle with heterophilic graphs, where connected nodes tend to have dissimilar features or labels. While numerous methods have been proposed to address this challenge, they primarily focus on architectural designs without directly targeting the root cause of the heterophily problem. These approaches still perform even worse than the simplest MLPs on challenging heterophilic datasets. For instance, our experiments show that 23 latest GNNs still fall behind the MLP on the Actor dataset. This critical challenge calls for an innovative approach to addressing graph heterophily beyond architectural designs. To bridge this gap, we propose and study a new and unexplored paradigm: directly increasing the graph homophily via a carefully designed graph transformation. In this work, we present a simple yet effective framework called Graph Homophily Booster (GRAPHITE) to address graph heterophily. To the best of our knowledge, this work is the first method that explicitly transforms the graph to directly improve the graph homophily. Stemmed from the exact definition of homophily, our proposed GRAPHITE creates feature nodes to facilitate homophilic message passing between nodes that share similar features. Furthermore, we both theoretically and empirically show that our proposed GRAPHITE significantly increases the homophily of originally heterophilic graphs, with only a slight increase in the graph size. Extensive experiments on challenging datasets demonstrate that our proposed GRAPHITE significantly outperforms state-of-the-art methods on heterophilic graphs while achieving comparable accuracy with state-of-the-art methods on homophilic graphs. Furthermore, our proposed graph transformation alone can already enhance the performance of homophilic GNNs on heterophilic graphs, even though they were not originally designed for heterophilic graphs. Our code is publicly available at https://github.com/q-rz/ICLR26-GRAPHITE .
Glance for Context: Learning When to Leverage LLMs for Node-Aware GNN-LLM Fusion
Donald Loveland ⋅ Yao-An Yang ⋅ Danai Koutra
Learning on text-attributed graphs has motivated the use of Large Language Models (LLMs) for graph learning. However, most fusion strategies are applied uniformly across all nodes and attain only small overall performance gains. We argue this result stems from aggregate metrics that obscure when LLMs provide benefit, inhibiting actionable signals for new strategies. In this work, we reframe LLM–GNN fusion around nodes where GNNs typically falter. We first show that performance can significantly differ between GNNs and LLMs, with each excelling on distinct structural patterns, such as local homophily. To leverage this finding, we propose GLANCE (GNN with LLM Assistance for Neighbor- and Context-aware Embeddings), a framework that invokes an LLM to refine a GNN's prediction. GLANCE employs a lightweight router that, given inexpensive per-node signals, decides whether to query the LLM. Since the LLM calls are non-differentiable, the router is trained with an advantage-based objective that compares the utility of querying the LLM against relying solely on the GNN. Across multiple benchmarks, GLANCE achieves the best performance balance across node subgroups, achieving significant gains on heterophilous nodes (up to +13\%) while simultaneously achieving top overall performance. Our findings highlight the value of adaptive, node-aware GNN-LLM architectures, where selectively invoking the LLM enables scalable deployment on large graphs without incurring high computational costs.
Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts
Kartik Sharma ⋅ Yiqiao Jin ⋅ Vineeth Rakesh ⋅ Yingtong Dou ⋅ Menghai Pan ⋅ Mahashweta Das ⋅ Srijan Kumar
As large language models (LLMs) are deployed in safety-critical settings, it is essential to ensure that their responses comply with safety standards. Prior research has revealed that LLMs often fail to grasp the notion of safe behaviors, resulting in either unjustified refusals to harmless prompts or the generation of harmful content. While substantial efforts have been made to improve their robustness, existing defenses often rely on costly fine-tuning of model parameters or employ suboptimal heuristic techniques. In this work, we take a novel approach to safeguard LLMs by learning to adapt the system prompts in instruction-tuned LLMs. While LLMs are typically pre-trained to follow a fixed system prompt, we investigate the impact of tailoring the system prompt to each specific user input on the safety of the responses. To this end, we propose Sysformer, a transformer model that updates an initial system prompt to a more robust system prompt in the LLM input embedding space while attending to the user prompt. While keeping the LLM parameters frozen, the Sysformer is trained to refuse to respond to a set of harmful prompts while responding ideally to a set of safe ones. Through extensive experiments on 5 LLMs from different families and 2 recent benchmarks, we demonstrate that Sysformer can significantly enhance the robustness of LLMs, leading to upto 80% gain in the refusal rate on harmful prompts while enhancing the compliance with the safe prompts by upto 90%. Results also generalize well to sophisticated jailbreaking attacks, making LLMs upto 100% more robust against different attack strategies. We hope our findings lead to cheaper safeguarding of LLMs and motivate future investigations into designing variable system prompts.
Foresight Diffusion: Improving Sampling Consistency in Predictive Diffusion Models
Yu Zhang ⋅ Xingzhuo Guo ⋅ Haoran Xu ⋅ Jialong Wu ⋅ Mingsheng Long
Diffusion and flow-based models have enabled significant progress in generation tasks across various modalities and have recently found applications in predictive learning. However, unlike typical generation tasks that encourage sample diversity, predictive learning entails different sources of stochasticity and requires sampling consistency aligned with the ground-truth trajectory, which is a limitation we empirically observe in diffusion models. We argue that a key bottleneck in learning sampling-consistent predictive diffusion models lies in suboptimal predictive ability, which we attribute to the entanglement of condition understanding and target denoising within shared architectures and co-training schemes. To address this, we propose Foresight Diffusion (ForeDiff), a framework for predictive diffusion models that improves sampling consistency by decoupling condition understanding from target denoising. ForeDiff incorporates a separate deterministic predictive stream to process conditioning inputs independently of the denoising stream, and further leverages a pretrained predictor to extract informative representations that guide generation. Extensive experiments on robot video prediction and scientific spatiotemporal forecasting show that ForeDiff improves both predictive accuracy and sampling consistency over strong baselines, offering a promising direction for predictive diffusion models.
Efficient Learning on Large Graphs using a Densifying Regularity Lemma
Jonathan Kouchly ⋅ Ben Finkelshtein ⋅ Michael Bronstein ⋅ Ron Levie
Learning on large graphs presents significant challenges, with traditional Message Passing Neural Networks suffering from computational and memory costs scaling linearly with the number of edges. We introduce the Intersecting Block Graph (IBG), a low-rank factorization of large directed graphs based on combinations of intersecting bipartite components, each consisting of a pair of communities, for source and target nodes. By giving less weight to non-edges, we show how an IBG can efficiently approximate any graph, sparse or dense. Specifically, we prove a constructive version of the weak regularity lemma: for any chosen accuracy, every graph can be approximated by a dense IBG whose rank depends only on that accuracy. This improves over prior versions of the lemma, where the rank depended on the number of nodes for sparse graphs. Our method allows for efficient approximation of large graphs that are both directed and sparse, a crucial capability for many real-world applications. We then introduce a graph neural network architecture operating on the IBG representation of the graph and demonstrating competitive performance on node classification, spatio-temporal graph analysis, and knowledge graph completion, while having memory and computational complexity linear in the number of nodes rather than edges.
On the Expressive Power of GNNs for Boolean Satisfiability
Saku Peltonen ⋅ Roger Wattenhofer
Machine learning approaches to solving Boolean Satisfiability (SAT) aim to replace handcrafted heuristics with learning-based models. Graph Neural Networks have emerged as the main architecture for SAT solving, due to the natural graph representation of Boolean formulas. We analyze the expressive power of GNNs for SAT solving through the lens of the Weisfeiler-Leman (WL) test. As our main result, we prove that the full WL hierarchy cannot, in general, distinguish between satisfiable and unsatisfiable instances. We show that indistinguishability under higher-order WL carries over to practical limitations for WL-bounded solvers that set variables sequentially. We further study the expressivity required for several important families of SAT instances, including regular, random and planar instances. To quantify expressivity needs in practice, we conduct experiments on random instances from the G4SAT benchmark and industrial instances from the 2024 SAT competition. Our results suggest that while random instances are largely distinguishable, industrial instances often require more expressivity to predict a satisfying assignment.
Learning from Algorithm Feedback: One-Shot SAT Solver Guidance with GNNs
Jan Tönshoff ⋅ Martin Grohe
Boolean Satisfiability (SAT) solvers are foundational to computer science, yet their performance typically hinges on hand-crafted heuristics. This work introduces Reinforcement Learning from Algorithm Feedback (RLAF) as a paradigm for learning to guide SAT solver branching heuristics with Graph Neural Networks (GNNs). Central to our approach is a novel and generic mechanism for injecting inferred variable weights and polarities into the branching heuristics of existing SAT solvers. In a single forward pass, a GNN assigns these parameters to all variables. Casting this one-shot guidance as a reinforcement learning problem lets us train the GNN with off-the-shelf policy-gradient methods, such as GRPO, directly using the solver's computational cost as the sole reward signal. Extensive evaluations demonstrate that RLAF-trained policies significantly reduce the mean solve times of different base solvers across diverse SAT problem distributions, achieving more than a 2x speedup in some cases, while generalizing effectively to larger and harder problems after training. Notably, these policies consistently outperform expert-supervised approaches based on learning handcrafted weighting heuristics, offering a promising path towards data-driven heuristic design in combinatorial optimization.
HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature
Devvrat Joshi ⋅ Islem Rekik
Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches face persistent challenges: they struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical and logically constrained nature of scientific knowledge. While general-purpose large language models (LLMs) offer some adaptability, they are computationally expensive and yield inconsistent accuracy on specialized, domain-heavy tasks such as scientific knowledge graph construction. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic “turns” in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram–aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. To the best of our knowledge, this is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler and more interpretable alternative to hyperbolic methods. To address data scarcity, we also release SPHERE, a large-scale, multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on benchmarks such as SciERC, SciER and SPHERE benchmarks, improving named entity recognition (NER) by 8.08\% and relation extraction (RE) by 5.99\% on the official out-of-distrubtion test sets. In zero-shot settings, the gains are even more pronounced, with improvements of 10.76\% for NER and 26.2\% for RE, marking a significant step toward reliable and scalable scientific knowledge graph construction.
Adversarial Robustness of Graph Transformers
Philipp Foth · Lukas Gosch · Simon Geisler · Leo Schwinn · Stephan Günnemann
Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. We close this gap and design the first adaptive attacks for GTs. In particular, we provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on pairwise shortest paths, random walks, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification, and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.
KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning
Yinyi Luo ⋅ Zhexian Zhou ⋅ Hao Chen ⋅ Kai Qiu ⋅ Marios Savvides ⋅ Sharon Li ⋅ Jindong Wang
Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith
Sampling-aware Adversarial Attacks Against Large Language Models
Tim Beyer ⋅ Yan Scholten ⋅ Leo Schwinn ⋅ Stephan Günnemann
To guarantee safe and robust deployment of large language models (LLMs) at scale, it is critical to accurately assess their adversarial robustness. Existing adversarial attacks typically target harmful responses in single-point greedy generations, overlooking the inherently stochastic nature of LLMs and overestimating robustness. We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack complements prompt optimization and serves as a strong and efficient attack vector. By casting attacks as a resource allocation problem between optimization and sampling, we determine compute-optimal trade-offs and show that integrating sampling into existing attacks boosts success rates by up to 37\% and improves efficiency by up to two orders of magnitude. We further analyze how distributions of output harmfulness evolve during an adversarial attack, discovering that many common optimization strategies have little effect on output harmfulness. Finally, we introduce a label-free proof-of-concept objective based on entropy maximization, demonstrating how our sampling-aware perspective enables new optimization targets. Overall, our findings establish the importance of sampling in attacks to accurately assess and strengthen LLM safety at scale.
WRING Out The Bias: A Rotation-Based Alternative To Projection Debiasing
Walter Gerych ⋅ Cassandra Parent ⋅ Quinn Perian ⋅ Rafiya Javed ⋅ Justin Solomon ⋅ Marzyeh Ghassemi
Vision-Language models (VLMs), including CLIP, are known to encode biases such as learning spurious correlations that falsely associate background attributes with particular labels. Debiasing approaches typically aim to isolate and remove subspaces corresponding to a target concept via projecting the embedding away from the concept. This strategy succeeds in debiasing VLM embeddings with respect to the concepts considered but can amplify biased shortcuts in unconsidered concepts. In practice, it is impossible to enumerate all possible biases, meaning that an increase in bias can go unobserved during evaluation. We propose a debiasing approach for a set of known concepts such that the relation to the remaining, unconsidered, concepts is minimally changed. We achieve this by rotating the VLM's embeddings within only a relevant subspace, rather than removing these subspaces, which mitigates unintended bias amplification.
CERTIFIED VS. EMPIRICAL ADVERSARIAL ROBUSTNESS VIA HYBRID CONVOLUTIONS WITH ATTENTION STOCHASTICITY
Joy Dhar ⋅ Song Xia ⋅ Manish Pandey ⋅ Maryam Haghighat ⋅ Azadeh Alavi ⋅ Ferdous Sohel ⋅ Wenyu Zhang ⋅ Nayyar Zaidi
We introduce Hybrid Convolutions with Attention Stochasticity (HyCAS), an adversarial defense that narrows the long-standing gap between provable robustness under ℓ2 certificates and empirical robustness against strong ℓ∞ attacks, while preserving strong generalization across diverse imaging benchmarks. HyCAS unifies deterministic and randomized principles by coupling 1-Lipschitz, spectrally normalized convolutions with two stochastic components—spectral normalized random-projection filters and a randomized attention-noise mechanism—to realize a randomized defense. Injecting smoothing randomness inside the architecture yields an overall ≤ 2-Lipschitz network with formal certificates. Extensive experiments on diverse imaging benchmarks—including CIFAR-10/100, ImageNet-1k, NIH Chest X-ray, HAM10000—show that HyCAS surpasses prior leading certified and empirical defenses, boosting certified accuracy by up to ≈ 7.3% (on NIH Chest X-ray) and empirical robustness by up to ≈ 3.1% (on HAM10000), without sacrificing clean accuracy. These results show that a randomized Lipschitz constrained architecture can simultaneously improve both certified ℓ2 and empirical ℓ∞ adversarial robustness, thereby supporting safer deployment of deep models in high-stakes applications.
Trapped by simplicity: When Transformers fail to learn from noisy features
Evan Peters ⋅ Matheus Zambianco ⋅ Ando Deng ⋅ Devin Blankespoor ⋅ Achim Kempf
Noise is ubiquitous in data used to train large language models, but it is not well understood whether these models are able to correctly generalize to inputs generated without noise. Here, we study noise-robust learning: are transformers trained on data with noisy features able to find a target function that correctly predicts labels for noiseless features? We show that transformers succeed at noise-robust learning for a selection of $k$-sparse parity and majority functions, compared to LSTMs which fail at this task for even modest feature noise. However, we find that transformers typically fail at noise-robust learning of random $k$-juntas, especially when the boolean sensitivity of the optimal solution is smaller than that of the target function. We argue that this failure is due to a combination of two factors: transformers' bias toward simpler functions, combined with an observation that the empirically optimal function for noise-robust learning has lower sensitivity than the target function. We test this hypothesis by exploiting transformers' simplicity bias to trap them in an incorrect solution, but show that transformers can escape this trap by training with an additional loss term penalizing high-sensitivity solutions. Overall, we find that transformers are particularly ineffective for learning boolean functions in the presence of feature noise.
Why Adversarially Train Diffusion Models?
Maria Rosaria Briglia ⋅ Mujtaba Mirza ⋅ Giuseppe Lisanti ⋅ Iacopo Masi
Adversarial Training (AT) is a known, powerful, well-established technique for improving classifier robustness to input perturbations, yet its applicability beyond discriminative settings remains limited. Motivated by the widespread use of score-based generative models and their need to operate robustly under substantial noisy or corrupted input data, we propose an adaptation of AT for these models, providing a thorough empirical assessment. We introduce a principled formulation of AT for Diffusion Models (DMs) that replaces the conventional invariance objective with an equivariance constraint aligned to the denoising dynamics of score matching. Our method integrates seamlessly into diffusion training by adding either random perturbations--similar to randomized smoothing--or adversarial ones--akin to AT. Our approach offers several advantages: (a) tolerance to heavy noise and corruption, (b) reduced memorization, (c) robustness to outliers and extreme data variability and (d) resilience to iterative adversarial attacks. We validate these claims on proof-of-concept low- and high-dimensional datasets with known ground-truth distributions, enabling precise error analysis. We further evaluate on standard benchmarks (CIFAR-10, CelebA, and LSUN Bedroom), where our approach shows improved robustness and preserved sample fidelity under severe noise, data corruption, and adversarial evaluation. Code available at github.com/OmnAI-Lab/Adversarial-Training-DM
TrainRef: Curating Data with Label Distribution and Minimal Reference for Accurate Prediction and Reliable Confidence
Murong Ma ⋅ Ruofan Liu ⋅ Yun Lin ⋅ Zhiyong Huang ⋅ Jin Song Dong
Practical classification requires both high predictive accuracy and reliable confidence for human-AI collaboration. Given that a high-quality dataset is expensive and sometimes impossible, learning with noisy labels (LNL) is of great importance. The state-of-the-art works propose many denoising approaches by categorically correcting the label noise, i.e., change a label from one class to another. While effective in improving accuracy, they are less effective for learning reliable confidence. This happens especially when the number of classes grows, giving rise to more ambiguous samples. In addition, traditional approaches usually curate the training dataset (e.g., reweighting samples or correcting data labels) by intrinsically learning normalities from the noisy dataset. The curation performance can suffer when the noisy ratio is high enough to form a polluting normality. In this work, we propose a training-time data-curation framework, TrainRef, to uniformly address predictive accuracy and confidence calibration by (1) an extrinsic small set of reference samples $D_{{ref}}$ to avoid normality pollution and (2) curate labels into a class distribution instead of a categorical class to handle sample ambiguity. Our insights lie in that the extrinsic information allows us to select more precise clean samples even when $|D_{{ref}}|$ equals to the number of classes (i.e., one sample per class). Technically, we design (1) a reference augmentation technique to select clean samples from the dataset based on $D_{{ref}}$; and (2) a model-dataset co-evolving technique for a near-perfect embedding space, which is used to vote on the class-distribution for the label of a noisy sample. Extensive experiments on CIFAR-100, Animal10N, and WebVision demonstrate that TrainRef outperform the state-of-the-art denoising techniques (DISC, L2B, and DivideMix) and model calibration techniques (label smoothing, Mixup, and temperature scaling). Furthermore, our user study shows that the model confidence trained by TrainRef well aligns with human intuition. More demonstration, proof, and experimental details are available at https://sites.google.com/view/train-ref.
FERD: Fairness-Enhanced Data-Free Adversarial Robustness Distillation
Zhengxiao Li ⋅ Liming Lu ⋅ Xu Zheng ⋅ Si-Yuan Liang ⋅ Taric Chen ⋅ Yongbin Zhou ⋅ Shuchao Pang
Data-Free Robustness Distillation (DFRD) aims to transfer the robustness from the teacher to the student without accessing the training data. While existing methods focus on overall robustness, they overlook the robust fairness issues, leading to severe disparity of robustness across different categories. In this paper, we find two key problems: (1) student model distilled with equal class proportion data behaves significantly different across distinct categories; and (2) the robustness of student model is not stable across different attacks target. To bridge these gaps, we present the first Fairness Enhanced data-free Robustness Distillation (FERD) framework to adjust the proportion and distribution of adversarial examples. For the proportion, FERD adopts a robustness guided class reweighting strategy to synthesize more samples for the less robust categories, thereby improving robustness of them. For the distribution, FERD generates complementary data samples for advanced robustness distillation. It generates Fairness-Aware Examples (FAEs) by enforcing a uniformity constraint on feature-level predictions, which suppress the dominance of class-specific non-robust features, providing a more balanced representation across all categories. Then, FERD constructs Uniform-Target Adversarial Examples (UTAEs) from FAEs by applying a uniform target class constraint to avoid biased attack directions, which distribute the attack targets across all categories and prevents overfitting to specific vulnerable categories. Extensive experiments on three public datasets demonstrate that FERD achieves state-of-the-art worst-class robustness and NSD under all adversarial attacks. For instance, FERD improves worst-class robustness by up to 11.3% and reduces NSD by 0.077 compared to the optimal baseline on CIFAR-10 with MobileNet-V2. Our code is available at: https://github.com/mayaobuduyao/FERD.
AP-OOD: Attention Pooling for Out-of- Distribution Detection
Claus Hofmann ⋅ Christian Huber ⋅ Bernhard Lehner ⋅ Daniel Klotz ⋅ Sepp Hochreiter ⋅ Werner Zellinger
Out-of-distribution (OOD) detection, which maps high-dimensional data into a scalar OOD score, is critical for the reliable deployment of machine learning models. A key challenge in recent research is how to effectively leverage and aggregate token embeddings from language models to obtain the OOD score. In this work, we propose AP-OOD, a novel OOD detection method for natural language that goes beyond simple average-based aggregation by exploiting token-level information. AP-OOD is a semi-supervised approach that flexibly interpolates between unsupervised and supervised settings, enabling the use of limited auxiliary outlier data. Empirically, AP-OOD sets a new state of the art in OOD detection for text: in the unsupervised setting, it reduces the FPR95 (false positive rate at 95% true positives) from 27.77% to 5.91% on XSUM summarization, and from 75.19% to 68.13% on WMT15 En–Fr translation.
Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability
Shojiro Yamabe ⋅ Jun Sakuma
Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation identifies that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. Furthermore, we demonstrate that the vulnerability enables existing optimization-based jailbreak attacks to be applied to MDLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate denoising steps containing affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method also improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research. Our code is available at https://github.com/mdl-lab/dlm-priming-vulnerability.
Are Deep Speech Denoising Models Robust to Adversarial Noise?
Will Schwarzer ⋅ Andrea Fanelli ⋅ Philip Thomas ⋅ Xiaoyu Liu
Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of psychoacoustically hidden adversarial noise, even in low-background-noise and simulated over-the-air settings. For three of the models, a small transcription study with audio and multimedia experts confirms unintelligibility of the attacked audio; simultaneously, an ABX study shows that the adversarial noise is generally imperceptible, with some variance between participants and samples. While we also establish several negative results around targeted attacks and model transfer, our results nevertheless highlight the need for practical countermeasures before open-source DNS systems can be used in safety-critical applications.
Variational Deep Learning via Implicit Regularization
Jonathan Wenger ⋅ Beau Coker ⋅ Juraj Marusic ⋅ John Cunningham
Modern deep learning models generalize remarkably well in-distribution, despite being overparametrized and trained with little to no explicit regularization. Instead, current theory credits implicit regularization imposed by the choice of architecture, hyperparameters, and optimization procedure. However, deep neural networks can be surprisingly non-robust, resulting in overconfident predictions and poor out-of-distribution generalization. Bayesian deep learning addresses this via model averaging, but typically requires significant computational resources as well as carefully elicited priors to avoid overriding the benefits of implicit regularization. Instead, in this work, we propose to regularize variational neural networks solely by relying on the implicit bias of (stochastic) gradient descent. We theoretically characterize this inductive bias in overparametrized linear models as generalized variational inference and demonstrate the importance of the choice of parametrization. Empirically, our approach demonstrates strong in- and out-of-distribution performance without additional hyperparameter tuning and with minimal computational overhead.
Concept-based Adversarial Attack: a Probabilistic Perspective
Andi Zhang ⋅ Xuan Ding ⋅ Steven McDonagh ⋅ Samuel Kaski
We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept - represented by a distribution - to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial examples and effectively preserve the underlying concept, while achieving higher attack efficiency.
Fine-Grained Iterative Adversarial Attacks with Limited Computation Budget
Zhichao Hou ⋅ Weizhi Gao ⋅ Xiaorui Liu
This work tackles a critical challenge in AI safety research under limited compute: given a fixed computation budget, how can one maximize the strength of iterative adversarial attacks? Coarsely reducing the number of attack iterations lowers cost but substantially weakens effectiveness. To fulfill the attainable attack efficacy within a constrained budget, we propose a fine-grained control mechanism that selectively recomputes layer activations across both iteration-wise and layer-wise levels. Extensive experiments show that our method consistently outperforms existing baselines at equal cost. Moreover, when integrated into adversarial training, it attains comparable performance with only 30\% of the original budget.
Robustify Spiking Neural Networks via Dominant Singular Deflation under Heterogeneous Training Vulnerability
Desong Zhang ⋅ Jia Hu ⋅ Geyong Min
Spiking Neural Networks (SNNs) process information via discrete spikes, enabling them to operate at remarkably low energy levels. However, our experimental observations reveal a striking vulnerability when SNNs are trained using the mainstream method—direct encoding combined with backpropagation through time (BPTT): even a single backward pass on data drawn from a slightly different distribution can lead to catastrophic network collapse. We refer to this phenomenon as the heterogeneous training vulnerability of SNNs. Our theoretical analysis attributes this vulnerability to the repeated inputs inherent in direct encoding and the gradient accumulation characteristic of BPTT, which together produce an exceptional large Hessian spectral radius. To address this challenge, we develop a hyperparameter-free method called Dominant Singular Deflation (DSD). By orthogonally projecting the dominant singular components of gradients, DSD effectively reduces the Hessian spectral radius, thereby preventing SNNs from settling into sharp minima. Extensive experiments demonstrate that DSD not only mitigates the vulnerability of SNNs under heterogeneous training, but also significantly enhances overall robustness compared to key baselines, providing strong support for safer SNNs.
Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian ⋅ Chenhao Lin ⋅ Zhengyu Zhao ⋅ Chao Shen
To address the trade-off between robustness and performance for robust VLM, we observe that function words could incur vulnerability of VLMs against cross-modal adversarial attacks, and propose Function-word De-Attention (FDA) accordingly to mitigate the vulnerability brought by function words. Inspired by differential transformers, our FDA calculates the original and the function-word cross-attention within attention heads, and differentially subtracts the latter from the former for more robust alignment. Comprehensive experiments include 2 SOTA baselines under 6 different attacks on 2 downstream tasks, 3 datasets, and 3 models. Overall, our FDA yields an average 18/13/53\% ASR drop with only 0.2/0.3/0.6\% performance drops on the 3 tested models on retrieval, and a 90\% ASR drop with a 0.3\% performance gain on visual grounding. We demonstrate the scalability, generalization, and zero-shot performance of FDA experimentally, as well as in-depth ablation studies and analysis. Code is available at https://github.com/michaeltian108/FDA.
Inverse Scaling in Test-Time Compute
Aryo Pradipta Gema · Alexander Hägele · Runjin Chen · Andy Arditi · Jacob Goldman-Wetzler · Kit Fraser-Taliente · Henry Sleight · Linda Petrini · Julian Michael · Beatrice Alex · Pasquale Minervini · Yanda Chen · Joe Benton · Ethan Perez
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
Parameterized Hardness of Zonotope Containment and Neural Network Verification
Vincent Froese ⋅ Moritz Grillo ⋅ Christoph Hertrich ⋅ Moritz Stargalla
Neural networks with ReLU activations are a widely used model in machine learning. It is thus important to have a profound understanding of the properties of the functions computed by such networks. Recently, there has been increasing interest in the (parameterized) computational complexity of determining these properties. In this work, we close several gaps and resolve an open problem posted by Froese et al. [COLT '25] regarding the parameterized complexity of various problems related to network verification. In particular, we prove that deciding positivity (and thus surjectivity) of a function $f\colon\mathbb{R}^d\to\mathbb{R}$ computed by a 2-layer ReLU network is W[1]-hard when parameterized by $d$. This result also implies that zonotope (non-)containment is W[1]-hard with respect to $d$, a problem that is of independent interest in computational geometry, control theory, and robotics. Moreover, we show that (a) approximating the maximum within any multiplicative factor in 2-layer ReLU networks, (b) computing the $L_p$-Lipschitz constant for $p\in(0,\infty]$ in 2-layer networks, and (c) approximating the $L_p$-Lipschitz constant in 3-layer networks are all NP-hard and W[1]-hard with respect to $d$. Notably, our hardness results are the strongest known so far and imply that the naive enumeration-based methods for solving these fundamental problems are all essentially optimal under the Exponential Time Hypothesis.
Intrinsic Entropy of Context Length Scaling in LLMs
Jingzhe Shi ⋅ Qinwei (Martin) Ma ⋅ Hongyi Liu ⋅ Hang Zhao ⋅ Jenq-Neng Hwang ⋅ Lei Li
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding of how long context impacts Language Modeling. In this work, we (1) propose to use `Intrinsic Entropy' for explaining the impact of context length on language modeling; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying the physics of Language Models.
Implicit bias produces neural scaling laws in learning curves, from perceptrons to deep networks
Francesco DAmico ⋅ Dario Bocchi ⋅ Matteo Negri
Scaling laws in deep learning -- empirical power-law relationships linking model performance to resource growth -- have emerged as simple yet striking regularities across architectures, datasets, and tasks. These laws are particularly impactful in guiding the design of state-of-the-art models, since they quantify the benefits of increasing data or model size, and hint at the foundations of interpretability in machine learning. However, most studies focus on asymptotic behavior at the end of training. In this work, we describe a richer picture by analyzing the entire training dynamics: we identify two novel \textit{dynamical} scaling laws that govern how performance evolves as function of different norm-based complexity measures. Combined, our new laws recover the well-known scaling for test error at convergence. Our findings are consistent across CNNs, ResNets, and Vision Transformers trained on MNIST, CIFAR-10 and CIFAR-100. Furthermore, we provide analytical support using a single-layer perceptron trained with logistic loss, where we derive the new dynamical scaling laws, and we explain them through the implicit bias induced by gradient-based training.
Regulating Internal Alignment Flows for Robust Learning Under Spurious Correlations
Rajeev Ranjan Dwivedi ⋅ Mohammedkaif Mohammedrafiq Kalagond ⋅ Niramay Patel ⋅ Vinod Kurmi
Deep models often exploit spurious correlations (e.g., backgrounds or dataset artifacts), hurting worst-group performance. We propose \textbf{Alignment-Gated Suppression (AGS)}, a lightweight, plug-in regularizer that intervenes inside the network during training. AGS tracks a class-conditional, confidence-weighted contribution for each neuron (more negative $\Leftrightarrow$ stronger support) and applies a percentile-based, multiplicative decay to the most extreme contributors, reducing overconfident shortcut pathways while leaving other features relatively more influential. AGS integrates with standard ERM, requires no group labels, and adds $<5\%$ training overhead. We provide analysis linking AGS to minority-margin gains, path-norm-like capacity control, and stability benefits via EMA-smoothed gating. Empirically, AGS improves worst-group accuracy and calibration vs.\ ERM and is competitive with state-of-the-art methods across spurious-correlation benchmarks (e.g., Waterbirds, CelebA, BAR, COCO), while maintaining strong average accuracy. These results suggest that regulating internal alignment flow is a simple and scalable route to robustness without group labels.
What Scales in Cross-Entropy Scaling Law?
Junxi Yan ⋅ Zixi Wei ⋅ Qingyao Ai ⋅ Yiqun LIU ⋅ Jingtao Zhan
The cross-entropy scaling law has long served as a key tool for guiding the development of large language models. It shows that cross-entropy loss decreases in a predictable power-law rate as the model size increases. However, recent evidence indicates that this law breaks down at very large scales: the loss decreases more slowly than expected, which causes significant trouble for developing large language models. In this paper, we hypothesize that the root cause lies in the fact that cross-entropy itself does not truly scale; instead, only one of its hidden components does. To investigate this, we introduce a novel decomposition of cross-entropy into three parts: Error-Entropy, Self-Alignment, and Confidence. We show both theoretically and empirically that this decomposition precisely captures the training dynamics and optimization objectives. Through extensive experiments on multiple datasets and 32 models spanning five orders of magnitude in size, we find that only error-entropy follows a robust power-law scaling, while the other two terms remain largely invariant. Moreover, error-entropy constitutes the dominant share of cross-entropy in small models but diminishes in proportion as models grow larger. This explains why the cross-entropy scaling law appears accurate at small scales but fails at very large ones. Our findings establish the error-entropy scaling law as a more accurate description of model behavior. We believe it will have wide applications in the training, understanding, and future development of large language models.
Risk Phase Transitions in Spiked Regression: Alignment Driven Benign and Catastrophic Overfitting
Jiping Li ⋅ Rishi Sonthalia
This paper analyzes the generalization error of minimum-norm interpolating solutions in linear regression using spiked covariance data models. The paper characterizes how varying spike strengths and target-spike alignments can affect risk, especially in overparameterized settings. The study presents an exact expression for the generalization error, leading to a comprehensive classification of benign, tempered, and catastrophic overfitting regimes based on spike strength, the aspect ratio $c=d/n$ (particularly as $c \to \infty$), and target alignment. Notably, in well-specified aligned problems, increasing spike strength can surprisingly induce catastrophic overfitting before achieving benign overfitting. The paper also reveals that target-spike alignment is not always advantageous, identifying specific, sometimes counterintuitive, conditions for its benefit or detriment. Alignment with the spike being detrimental is empirically demonstrated to persist in nonlinear models.
Some Neural Networks Inherently Preserve Subspace Clustering Structure
Karan Vikyath Veeranna Rupashree ⋅ Siddharth Baskar ⋅ Daniel L Pimentel-Alarcón
It has long been conjectured and empirically observed that neural networks tend to preserve clustering structure. This paper formalizes this conjecture. Specifically, we establish precise conditions for cluster structure preservation and derive bounds to quantify its extent. Through this analysis we are able to show that certain neural networks are learning parameters that preserve the clustering structure of the original data in their embeddings, without the need to impose mechanisms to promote this behavior. Extensive numerical analysis and experiments validate our results. Our findings offer deeper insight into neural network behavior, explaining why certain data types (such as images, audio, and text) benefit more from deep learning. Beyond theory, our findings guide better initialization, feature encoding, and regularization strategies.
Learning on a Razor’s Edge: Identifiability and Singularity of Polynomial Neural Networks
Vahid Shahverdi ⋅ Giovanni Luca Marchetti ⋅ Kathlén Kohn
We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Vage Egiazarian ⋅ Roberto Castro ⋅ Denis Kuznedelev ⋅ Andrei Panferov ⋅ Eldar Kurtic ⋅ Shubhra Pandit ⋅ Alexandre Marques ⋅ Mark Kurtz ⋅ Saleh Ashkboos ⋅ Torsten Hoefler ⋅ Dan Alistarh
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4's small group size \emph{provably} neutralizes traditional outlier mitigation techniques; (2) MXFP4's power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4's unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.
Benchmarking LLM Tool-Use in the Wild
Peijie Yu ⋅ Wei Liu ⋅ Yifan Yang ⋅ Jinjian Li ⋅ Zelong Zhang ⋅ Xiao Feng ⋅ feng zhang
Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently $\textbf{wild}$, being intricate, messy, and flexible. We identify three key challenges from user behaviour: $\textit{compositional tasks}$ that demand efficient orchestration of tool-call topologies, $\textit{implicit intent}$ spread across dialogue turns that require contextual inference, and $\textit{instruction transition}$, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce $\textbf{\textit{WildToolBench}}$, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15\%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among $\textit{LLMs}$, $\textit{users}$, and $\textit{tools}$.
On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning
Tongtian Zhu ⋅ Tianyu Zhang ⋅ Mingze Wang ⋅ Zhanpeng Zhou ⋅ Can Wang
Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.
Inductive Reasoning for Temporal Knowledge Graphs with Emerging Entities
Ze Zhao ⋅ Yuhui He ⋅ Lyuwen Wu ⋅ Gu Tang ⋅ Bin Lu ⋅ Xiaoying Gan ⋅ Luoyi Fu ⋅ Xinbing Wang ⋅ Chenghu Zhou
Reasoning on Temporal Knowledge Graphs (TKGs) is essential for predicting future events and time-aware facts. While existing methods are effective at capturing relational dynamics, their performance is limited by a closed-world assumption, which fails to account for emerging entities not present in the training. Notably, these entities continuously join the network without historical interactions. Empirical study reveals that emerging entities are widespread in TKGs, comprising roughly 25\% of all entities. The absence of historical interactions of these entities leads to significant performance degradation in reasoning tasks. Whereas, we observe that entities with semantic similarities often exhibit comparable interaction histories, suggesting the presence of transferable temporal patterns. Inspired by this insight, we propose TransFIR (Transferable Inductive Reasoning), a novel framework that leverages historical interaction sequences from semantically similar known entities to support inductive reasoning. Specifically, we propose a codebook-based classifier that categorizes emerging entities into latent semantic clusters, allowing them to adopt reasoning patterns from similar entities. Experimental results demonstrate that TransFIR outperforms all baselines in reasoning on emerging entities, achieving an average improvement of 28.6\% in Mean Reciprocal Rank (MRR) across multiple datasets. The implementations are available at https://anonymous.4open.science/r/TransFIR-C72F.
Composer: A Search Framework for Hybrid Neural Architecture Design
Bilge Acun ⋅ Prasoon Sinha ⋅ Newsha Ardalani ⋅ Sangmin Bae ⋅ Alicia Golden ⋅ Meghana Madhyastha ⋅ Chien-Yu Lin ⋅ Fei Sun ⋅ Neeraja Yadwadkar ⋅ Carole-Jean Wu
Hybrid model architectures that combine computational primitives (e.g., Attention, MLP) in different ratios have shown promising performance beyond Transformers. Some studies have shown that different interleavings of primitives can affect model quality as well. However, prior works explore the hybrid model architecture design space manually. Due to the large design space and training costs, discovering hybrid models that combine key computational primitives for pre-training is challenging. In this work, we take a principled approach in designing a modular hybrid model architecture search framework — Composer. Composer explores model architectures at a small scale and extrapolates the top-performing model architectures to a larger scale using our proposed scaling strategies. Using Composer, we discover new hybrid LLM architectures that outperform Llama 3.2. Compared to Llama 3.2 and previous state-of-the-art baselines, the new model architectures consistently reduce validation loss at parameter scales of 350M-8B and improve evaluation accuracy on the downstream tasks by up to 2-2.1% on average while improving both training and inference efficiency.
Never Saddle for Reparameterized Steepest Descent as Mirror Flow
Tom Jacobs ⋅ Chao Zhou ⋅ Rebekka Burkholz
How does the choice of optimization algorithm shape a model’s ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows as a unifying theoretical framework. This framework reveals how optimization geometry governs learning dynamics, implicit bias, and sparsity and it provides two explanations for why Adam and AdamW often outperform SGD in fine-tuning. Focusing on diagonal linear networks and deep diagonal linear reparameterizations (a simplified proxy for attention), we show that steeper descent facilitates both saddle-point escape and feature learning. In contrast, gradient descent requires unrealistically large learning rates to escape saddles, an uncommon regime in fine-tuning. Empirically, we confirm that saddle-point escape is a central challenge in fine-tuning. Furthermore, we demonstrate that decoupled weight decay, as in AdamW, stabilizes feature learning by enforcing novel balance equations. Together, these results highlight two mechanisms how steepest descent can aid modern optimization.
Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork
Lu Sun ⋅ Jun Sakuma
Large Language Models (LLMs) achieve state-of-the-art performance but are costly to deploy in resource-constrained environments. Pruning with $n:m$ semi-structured sparsity reduces computation and enables hardware acceleration, yet existing methods face a trade-off: one-shot approaches are efficient but heuristic, while optimization-based methods are accurate but expensive. We introduce \textbf{HyperPrune}, a resource-efficient framework that directly optimizes $n:m$ sparsity. A lightweight hypernetwork, shared across layers and conditioned on learnable embeddings, generates structured masks in a one-shot, layer-wise manner. \textit{Continual pruning} preserves cross-layer knowledge, and \textit{feature outlier regularization} retains critical activations, unifying the strengths of heuristic and optimization-based methods. Experiments on LLaMA-7B to 70B show state-of-the-art accuracy–sparsity trade-offs on a single A100 GPU, achieving higher efficiency, accuracy, and scalability than prior approaches. HyperPrune offers a practical, scalable, and hardware-friendly solution for structured LLM pruning.
FoNE: Precise Single-Token Number Embeddings via Fourier Features
Tianyi Zhou ⋅ Deqing Fu ⋅ Mahdi Soltanolkotabi ⋅ Robin Jia ⋅ Vatsal Sharan
Language models treat numbers in the same way as ordinary word tokens, which introduces two major issues: (1) embeddings of numerical tokens primarily reflect their frequency in text corpora rather than their inherent numerical properties, leading to frequency bias, and (2) numbers are often split into multiple tokens, forcing the model to aggregate these pieces to recover their values. Inspired by the observation that pre-trained Large Language Models (LLMs) internally learn Fourier-like features for number tokens, we propose **Fo**urier **N**umber **E**mbedding **(FoNE)**, a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. Compared to traditional subword and digit-wise embeddings, FoNE achieves higher accuracy on arithmetic tasks, requires significantly less training data, and offers more efficient training and inference. A $38$M-parameter Transformer trained from scratch with FoNE outperforms a fine-tuned Llama-3.2-1B model on addition, subtraction, and multiplication. FoNE is also the only method that achieves $100\\%$ accuracy on over 100,000 test examples across these tasks. On 6-digit decimal addition, FoNE needs 64$\times$ less data than subword and digit-wise embeddings to reach $\ge 99\\%$ accuracy, while using 3$\times$ and 6$\times$ fewer tokens per number, respectively.
BoRA: Towards More Expressive Low-Rank Adaptation with Block Diversity
Shiwei Li ⋅ Xiandi Luo ⋅ Haozhao Wang ⋅ Xing Tang ⋅ Ziqiang Cui ⋅ Dugang Liu ⋅ Yuhua Li ⋅ Yichen Li ⋅ xiuqiang He ⋅ Ruixuan Li
Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method widely used in large language models (LLMs). It approximates the update of a pretrained weight matrix $W\in\mathbb{R}^{m\times n}$ by the product of two low-rank matrices, $BA$, where $A \in\mathbb{R}^{r\times n}$ and $B\in\mathbb{R}^{m\times r} (r\ll\min\{m,n\})$. Increasing the dimension $r$ can raise the rank of LoRA weights (i.e., $BA$), which typically improves fine-tuning performance but also significantly increases the number of trainable parameters. In this paper, we propose **Block Diversified Low-Rank Adaptation (BoRA)**, which improves the rank of LoRA weights with a small number of additional parameters. Specifically, BoRA treats the product $BA$ as a block matrix multiplication, where $A$ and $B$ are partitioned into $b$ blocks along the columns and rows, respectively (i.e., $A=[A_1,\dots,A_b]$ and $B=[B_1,\dots,B_b]^\top$). Consequently, the product $BA$ becomes the concatenation of the block products $B_iA_j$ for $i,j\in[b]$. To enhance the diversity of different block products, BoRA introduces a unique diagonal matrix $\Sigma_{i,j} \in \mathbb{R}^{r\times r}$ for each block multiplication, resulting in $B_i \Sigma_{i,j} A_j$. By leveraging these block-wise diagonal matrices, BoRA increases the rank of LoRA weights by a factor of $b$ while only requiring $b^2r$ additional parameters. Extensive experiments across multiple datasets and models demonstrate the superiority of BoRA, and ablation studies further validate its scalability.
Compositional Generalization from Learned Skills via CoT Training: A Theoretical and Structural Analysis for Reasoning
Xinhao Yao ⋅ Ruifeng Ren ⋅ Yun Liao ⋅ Lizhong Ding ⋅ Yong Liu
Chain-of-Thought (CoT) training has markedly advanced the reasoning capabilities of large language models (LLMs), yet the mechanisms by which CoT training enhances generalization remain inadequately understood. In this work, we demonstrate that compositional generalization is fundamental: models systematically combine simpler learned skills during CoT training to address novel and more complex problems. Through a theoretical and structural analysis, we formalize this process: 1) Theoretically, the information-theoretic generalization bounds through distributional divergence can be decomposed into in-distribution (ID) and out-of-distribution (OOD) components. Specifically, the non-CoT models fail on OOD tasks due to unseen compositional patterns, whereas CoT-trained models achieve strong generalization by composing previously learned skills. In addition, controlled experiments and real-world validation confirm that CoT training accelerates convergence and enhances generalization from ID to both ID and OOD scenarios while maintaining robust performance even with tolerable noise. 2) Structurally, CoT training internalizes reasoning into a two-stage compositional circuit, where the number of stages corresponds to the explicit reasoning steps during training. Notably, CoT-trained models resolve intermediate results at shallower layers compared to non-CoT counterparts, freeing up deeper layers to specialize in subsequent reasoning steps. A key insight is that CoT training teaches models how to think—by fostering compositional reasoning—rather than merely what to think, through the provision of correct answers alone. This paper offers valuable insights for designing CoT strategies to enhance LLMs' reasoning robustness.
Understanding the Learning Phases in Self-Supervised Learning via Critical Periods
JangHyeon Lee ⋅ Philipe Dias ⋅ Yao-Yi Chiang ⋅ Dalton Lunga
Self-supervised learning (SSL) has emerged as a powerful pretraining strategy to learn transferable representations from unlabeled data. Yet, it remains unclear how long SSL models should be pretrained to yield such representations. Contrary to the prevailing heuristic that longer pretraining translates to better downstream performance, we observe a transferability trade-off: across diverse SSL settings, intermediate checkpoints can yield stronger out-of-domain (OOD) generalization, whereas additional pretraining primarily benefits in-domain (ID) performance. From this observation, we hypothesize that SSL progresses through learning phases that can be characterized via the lens of critical periods (CP). Prior work on CP has shown that supervised models exhibit an early phase of high plasticity, followed by a consolidation phase where adaptability declines but task-specific performance increases. Since traditional CP analysis was developed for supervised settings, we rethink it for SSL in two ways. First, we inject deficits to perturb the pretraining data and assess their lasting impact on representation quality via downstream tasks. Second, we compute the Fisher Information on pretext objectives to track plasticity, quantifying how sensitive model parameters are to the pretext task. Our experiments suggest that SSL models may exhibit their own CP, with CP closure coinciding with a sweet spot for broad downstream transferability. Leveraging these insights, we introduce CP-guided checkpoint selection as a strategy for selecting checkpoints that offer stronger OOD transferability. Finally, to balance the transferability trade-off, we present CP-guided self-distillation, which selectively distills layer representations from the intermediate checkpoint into their overspecialized counterparts in the final checkpoint.
Training deep neural networks remains costly in terms of data, time, and energy, limiting their deployment in large-scale and resource-constrained settings. To address this, we propose Batch Pruning by Activation Stability (B-PAS), a dynamic plug-in strategy that accelerates training by removing batches that contribute less to learning. B-PAS monitors the stability of activation representations across epochs and prunes batches whose activation variance exhibits minimal change, indicating diminishing learning utility. Applied to ResNet-18, ResNet-50, and the Convolutional vision Transformer (CvT) on CIFAR-10, CIFAR-100, SVHN, and ImageNet-1K, B-PAS reduces training batch usage by up to 57\% with no loss in accuracy, and by 47\% while slightly improving accuracy. Moreover, it achieves up to 61\% savings in GPU node-hours, outperforming prior state-of-the-art pruning methods with up to 29\% higher data savings and 21\% greater GPU node-hour savings. We further demonstrate the generalization of B-PAS by extending it to GPT-2 fine-tuning, showing that activation stability can serve as an effective pruning signal beyond vision models. These results highlight activation stability as a powerful internal signal for efficient training, offering a practical and sustainable path toward data and energy-efficient deep learning.
Can Language Models Discover Scaling Laws?
Haowei Lin ⋅ Haotian Ye ⋅ Wenzheng Feng ⋅ Quzhe Huang ⋅ Yujun Li ⋅ Hubert Lim ⋅ Zhengrui Li ⋅ Xiangyu Wang ⋅ Jianzhu Ma ⋅ Yitao Liang ⋅ James Y Zou
Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate seven diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.
Rejuvenating Cross-Entropy Loss in Knowledge Distillation for Recommender Systems
Zhangchi Zhu ⋅ Wei Zhang
This paper analyzes Cross-Entropy (CE) loss in knowledge distillation (KD) for recommender systems. KD for recommender systems targets at distilling rankings, especially among items most likely to be preferred, and can only be computed on a small subset of items. Considering these features, we reveal the connection between CE loss and NDCG in the field of KD. We prove that when performing KD on an item subset, minimizing CE loss maximizes the lower bound of NDCG, only if an assumption of closure is satisfied. It requires that the item subset consists of the student's top items. However, this contradicts our goal of distilling rankings of the teacher's top items. We empirically demonstrate the vast gap between these two kinds of top items. To bridge the gap between our goal and theoretical support, we propose Rejuvenated Cross-Entropy for Knowledge Distillation (RCE-KD). It splits the top items given by the teacher into two subsets based on whether they are highly ranked by the student. For the subset that defies the condition, a sampling strategy is devised to use teacher-student collaboration to approximate our assumption of closure. We also combine the losses on the two subsets adaptively. Extensive experiments demonstrate the effectiveness of our method. Our code is available at https://github.com/BDML-lab/RCE-KD.
Beyond Uniformity: Regularizing Implicit Neural Representations through a Lipschitz Lens
Julian McGinnis ⋅ Suprosanna Shit ⋅ Florian A. Hölzl ⋅ Paul Friedrich ⋅ Paul Büschl ⋅ Vasiliki Sideri-Lampretsa ⋅ Mark Mühlau ⋅ Philippe Cattin ⋅ Bjoern Menze ⋅ Daniel Rueckert ⋅ Benedikt Wiestler
Implicit Neural Representations (INRs) have shown great promise in solving inverse problems, but their lack of inherent regularization often leads to a trade-off between expressiveness and smoothness. While Lipschitz continuity presents a principled form of implicit regularization, it is often applied as a rigid, uniform 1-Lipschitz constraint, limiting its potential in inverse problems. In this work, we reframe Lipschitz regularization as a flexible *Lipschitz budget framework*. We propose a method to first derive a principled, task-specific total budget $K$, then proceed to distribute this budget *non-uniformly* across all network components, including linear weights, activations, and embeddings. Across extensive experiments on deformable registration and image inpainting, we show that non-uniform allocation strategies provide a measure to balance regularization and expressiveness within the specified global budget. Our *Lipschitz lens* introduces an alternative, interpretable perspective to Neural Tangent Kernel (NTK) and Fourier analysis frameworks in INRs, offering practitioners actionable principles for improving network architecture and performance. Code and experimental results are available at: [https://lipschitz-inrs.github.io](https://lipschitz-inrs.github.io).
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
Chenxiao Yang ⋅ Cai Zhou ⋅ David Wipf ⋅ Zhiyuan Li
Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.
Train-before-Test Harmonizes Language Model Rankings
Guanhua Zhang ⋅ Ricardo Dominguez-Olmedo ⋅ Moritz Hardt
Existing language model benchmarks provide contradictory model rankings, even for benchmarks that aim to capture similar skills. This dilemma of conflicting rankings hampers model selection, clouds model comparisons, and adds confusion to a growing ecosystem of competing models. In this paper, we take a different perspective on model comparison: instead of relying on out-of-the-box performance via direct evaluation, we compare model potential by providing each model with identical benchmark-specific fine-tuning before evaluation. We call this approach train-before-test. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings obtained through train-before-test exhibit remarkable consistency across all benchmarks. Whereas traditional rankings demonstrate little external validity under direct evaluation, they enjoy a significant degree of external validity when applying train-before-test: model potential rankings transfer gracefully from one benchmark to another. Second, train-before-test restores the connection between perplexity and downstream task performance, lost under direct evaluation. Remarkably, even pre-finetuning perplexity of a base model predicts post-finetuning downstream performance, suggesting that ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating that model potential is dominated by one latent factor, uncovered by train-before-test. While direct evaluation remains useful for assessing deployment-ready performance, train-before-test provides a complementary lens for understanding achievable performance of models after adaptation.
Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR
Hao Yi ⋅ Yulan Hu ⋅ Xin Li ⋅ Sheng Ouyang ⋅ Lizhong Ding ⋅ Yong Liu
Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring \textbf{objective uncertainty} when only selecting by subjective uncertainty. This work proposes an \textbf{uncertainty consistency} metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30\% of the data, effectively reducing the cost of RLVR for reasoning tasks.\footnote{The code is available at \hyperref[https://github.com/yihao-123/uncertainty-consistency]{https://github.com/yihao-123/uncertainty-consistency}. }
Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of \textit{choice sensitivity}, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods — such as those based on log-likelihood or its length-normalized variant — are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.
E²LoRA: Efficient and Effective Low-Rank Adaptation with Entropy-Guided Adaptive Sharing
Minglei Li ⋅ Peng Ye ⋅ Jingqi Ye ⋅ Haonan He ⋅ Tao Chen
As large pre-trained models rapidly scale, Parameter-Efficient Fine-Tuning (PEFT) through methods like Low-Rank Adaptation (LoRA) becomes increasingly crucial. While LoRA has emerged as a cornerstone of PEFT, excelling at preserving performance with minimal additional parameters, exploring parameter-sharing mechanisms of LoRA remains critical to pushing efficiency boundaries. However, existing naive LoRA sharing methods often degrade performance due to sacrificed representational diversity and weakened model expressiveness. To overcome this issue, we conduct an in-depth analysis of pre-trained models using gradient-based proxy entropy, and uncover two critical, previously overlooked properties: Local Similarity and Layer-wise Information Heterogeneity. Building on these insights, we propose E²LoRA, a novel dual-adaptive sharing framework. It enables adaptive sharing interval partitioning, guided by inter-layer proxy entropy similarity, and adaptive rank allocation, informed by layer-wise absolute proxy entropy. This unique design leverages inherently informative properties of pre-trained models to significantly reduce parameter redundancy while maintaining or enhancing expressiveness. Comprehensive evaluations across diverse tasks, modalities, and models consistently demonstrate that E²LoRA achieves an excellent balance of efficiency and effectiveness, consistently matching or surpassing baselines with approximately 50% fewer trainable parameters.
DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems
Meiru Zhang ⋅ Philipp Borchert ⋅ Milan Gritta ⋅ Gerasimos Lampouras
Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal statements lack direct mappings to mathematical theorems and lemmata, nor do those theorems translate trivially into the formal primitives of languages like Lean. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable "sub-components". This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 42.25% and 37.14% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model's capabilities.
Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
Lukas Rauch ⋅ René Heinrich ⋅ Houtan Ghaffari ⋅ Lukas Miklautz ⋅ Ilyass Moummad ⋅ Bernhard Sick ⋅ Christoph Scholz
Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning {when pursuing state-of-the-art on AudioSet}. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.
Soft Tokens, Hard Truths
Natasha Butt ⋅ Ariel Kwiatkowski ⋅ Ismail Labiad ⋅ Julia Kempe ⋅ Yann Ollivier
The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use "soft" tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@$1$ and surpass them for pass@$32$, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the "soft" models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model.
Qronos: Correcting the Past by Shaping the Future... in Post-Training Quantization
Shihao Zhang ⋅ Haoyu Zhang ⋅ Ian Colbert ⋅ Rayan Saab
We introduce Qronos---a new post-training quantization algorithm that not only explicitly corrects errors due to both weight and activation quantization, but also corrects errors accumulated from previously quantized layers. Our iterative algorithm is based on an interpretable and disciplined optimization framework that surpasses existing data-driven approaches. At each step, Qronos alternates between error correction and diffusion via optimal update rules. Importantly, we prove that Qronos admits an equivalent formulation that significantly improves algorithmic efficiency; we use our discovery to reduce peak memory usage by 18\times on Llama3 8B, and our scaling analysis shows a speedup of up to 13.8\times for a single-layer microbenchmark. We demonstrate compatibility with existing transformation techniques such as Hadamard-based incoherence processing and weight-activation scaling equalization, among others. We evaluate Qronos using recent language models in the Llama3 and Qwen3 families; Qronos consistently outperforms previous state-of-the-art adaptive rounding methods when quantizing the weights, activations, and/or KV caches to 4 bits or fewer.
Social Agents: Collective Intelligence Improves LLM Predictions
Aanisha Bhattacharyya ⋅ Abhilekh Borah ⋅ Yaman Singla ⋅ Rajiv Ratn Shah ⋅ Changyou Chen ⋅ Balaji Krishnamurthy
In human society, collective decision making has often outperformed the judgment of individuals. Classic examples range from estimating livestock weights to predicting elections and financial markets, where averaging many independent guesses often yields results more accurate than experts. These successes arise because groups bring together diverse perspectives, independent voices, and distributed knowledge, combining them in ways that cancel individual biases. This principle, known as the Wisdom of Crowds, underpins practices in forecasting, marketing, and preference modeling. Large Language Models (LLMs), however, typically produce a single definitive answer. While effective in many settings, this uniformity overlooks the diversity of human judgments shaping responses to ads, videos, and webpages. Inspired by how societies benefit from diverse opinions, we ask whether LLM predictions can be improved by simulating not one answer but many. We introduce Social Agents, a multi-agent framework that instantiates a synthetic society of human-like personas with diverse demographic (e.g., age, gender) and psychographic (e.g., values, interests) attributes. Each persona independently appraises a stimulus such as an advertisement, video, or webpage, offering both a quantitative score (e.g., click-through likelihood, recall score, likability) and a qualitative rationale. Aggregating these opinions produces a distribution of preferences that more closely mirrors real human crowds. Across eleven behavioral prediction tasks, Social Agents outperforms single-LLM baselines by up to 67.45% on simple judgments (e.g. webpage likability) and 9.88% on complex interpretive reasoning (e.g. video memorability). Social Agents’ individual persona predictions also align with human judgments, reaching Pearson correlations up to 0.71. These results position computational crowd simulation as a scalable, interpretable tool for improving behavioral prediction and supporting societal decision making.
Relatron: Automating Relational Machine Learning over Relational Databases
Zhikai Chen ⋅ Han Xie ⋅ Jian Zhang ⋅ Jiliang Tang ⋅ Xiang song ⋅ Huzefa Rangwala
Predictive modeling over relational databases (RDBs) powers applications in various domains, yet remains challenging due to the need to capture both cross-table dependencies and complex feature interactions. Recent Relational Deep Learning (RDL) methods automate feature engineering via message passing, while classical approaches like Deep Feature Synthesis (DFS) rely on predefined non-parametric aggregators. Despite promising performance gains, the comparative advantages of RDL over DFS and the design principles for selecting effective architectures remain poorly understood. We present a comprehensive study that unifies RDL and DFS in a shared design space and conducts large-scale architecture-centric searches across diverse RDB tasks. Our analysis yields three key findings: (1) RDL does not consistently outperform DFS, with performance being highly task-dependent; (2) no single architecture dominates across tasks, underscoring the need for task-aware model selection; and (3) validation accuracy is an unreliable guide for architecture choice. This search yields a curated model performance bank that links model architecture configurations to their performance; leveraging this bank, we analyze the drivers of the RDL–DFS performance gap and introduce two task signals—RDB task homophily and an affinity embedding that captures size, path, feature, and temporal structure—whose correlation with the gap enables principled routing. Guided by these signals, we propose Relatron, a task embedding-based meta-selector that first chooses between RDL and DFS and then prunes the within-family search to deliver strong performance. Lightweight loss-landscape metrics further guard against brittle checkpoints by preferring flatter optima. In experiments, Relatron resolves the “more tuning, worse performance” effect and, in joint hyperparameter–architecture optimization, achieves up to 18.5\% improvement over strong baselines with $10\times$ lower computational cost than Fisher information–based alternatives. Our code is available at https://github.com/amazon-science/Automating-Relational-Machine-Learning.
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
Yaswanth Narsupalli · Abhranil Chandra · Sreevatsa Muppirala · Manish Gupta · Pawan Goyal
Assessing the quality of generative model outputs from large language models (LLMs) or vision-language models (VLMs), poses significant challenges. Traditional evaluation methods either rely on human assessment which is resource-intensive and not scalable or on automatic metrics that often correlate poorly with human preferences. Another approach is to train dedicated neural evaluators, but this typically requires substantial training data and compute. In this study, we thus introduce ReFeR, a tuning-free framework for evaluating generative outputs including both text and images, using a two-level hierarchy of pre-trained LLM and VLM evaluators. This multi-agent hierarchical strategy leverages additional compute at inference time by orchestrating multiple models and utilizing the increased test-time reasoning to boost performance. By having models themselves provide feedback and final judgments, ReFeR reduces the dependence on human evaluation. We rigorously evaluate ReFeR on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, ReFeR is also applicable for reasoning tasks - experiments on four reasoning benchmarks show ReFeR’s superior collective reasoning abilities. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more test-time compute efficient solution. ReFeR-Lite is $\sim12-14\times$ more compute efficient than previous works while being comparably accurate to ReFeR-Turbo.
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
Shixian Luo ⋅ Zhu zezhou ⋅ Yu Yuan ⋅ Yuncheng Yang ⋅ Lianlei Shan ⋅ Yong Wu
Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the \texttt{Program-to-Geometry} task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present \textbf{GeoGramBench}, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50\% accuracy at the highest abstraction level. By systematically analyzing model behaviors, our study exposes key limitations in program-driven spatial reasoning and positions GeoGramBench as an important resource for benchmarking and advancing behavioral research in symbolic-to-spatial geometric reasoning.
PepTri: Tri-Guided All-Atom Diffusion for Peptide Design via Physics, Evolution, and Mutual Information
Ngoc-Quang Nguyen ⋅ Jaeyoon Jung ⋅ Seijung Kim ⋅ Sunkyu Kim ⋅ Jaewoo Kang
Peptides, short chains of amino acids capable of high-specificity protein binding, represent a powerful class of therapeutics. While deep generative models have shown promise for peptide design, existing approaches are often structure-centric and therefore generate sequences and structures in a decoupled manner, failing to ensure that designs are simultaneously physically stable, evolutionarily plausible, and internally coherent. To overcome this limitation, we introduce \textbf{PepTri}, a novel diffusion framework that addresses this by jointly generating peptide sequences and 3D structures within a unified, SE(3)-equivariant latent space. Our proposed model integrates three complementary guidance signals during the generative process: (i) physics-informed guidance via differentiable molecular mechanics to ensure structural stability and realism; (ii) evolutionary guidance to bias sequences toward conserved, functional motifs; and (iii) mutual information guidance to explicitly maximize sequence-structure coherence. This tri-guided approach ensures the generative process is steered by biophysical laws, biological priors, and information-theoretic alignment in tandem. Extensive evaluations on challenging peptide-protein design benchmarks, cross-domain (PepBench, LNR) and in-domain (PepBDB), demonstrate that PepTri substantially outperforms strong baselines, achieving state-of-the-art results in binding affinity, structural accuracy, and design diversity. Our results establish that integrating these complementary signals directly into the denoising process is crucial for generating viable, high-quality peptide medicines. PepTri is available at: https://github.com/aigensciences/PepTri
Scaling Laws and Symmetry, Evidence from Neural Force Fields
Nhat Khang Ngo ⋅ Siamak Ravanbakhsh
We present an empirical study in the geometric task of learning interatomic potentials, which shows equivariance matters even more at larger scales; we show a clear power-law scaling behaviour with respect to data, parameters and compute with “architecture-dependent exponents”. In particular, we observe that equivariant architectures, which leverage task symmetry, scale better than non-equivariant models. Moreover, among equivariant architectures, higher-order representations translate to better scaling exponents. Our analysis also suggests that for computeoptimal training, the data and model sizes should scale in tandem regardless of the architecture. At a high level, these results suggest that, contrary to common belief, we should not leave it to the model to discover fundamental inductive biases such as symmetry, especially as we scale, because they change the inherent difficulty of the task and its scaling laws.
PoseX: AI Defeats Physics-based Methods on Protein Ligand Cross-Docking
Yize Jiang ⋅ Xinze Li ⋅ Yuanyuan Zhang ⋅ Jin Han ⋅ Youjun Xu ⋅ Ayush Pandit ⋅ ZAIXI ZHANG ⋅ Mengdi Wang ⋅ Mengyang Wang ⋅ Chong Liu ⋅ Guang Yang ⋅ Yejin Choi ⋅ Yingzhou Lu ⋅ Wu-Jun Li ⋅ Tianfan Fu ⋅ Fang Wu ⋅ Junhong Liu
Recently, significant progress has been made in protein-ligand docking, especially in deep learning methods, and some benchmarks were proposed, such as PoseBench and PLINDER. However, these studies typically focus on the self-docking scenario, which is less practical in real-world applications. Moreover, some studies employ complex frameworks that require extensive training, posing challenges for convenient and efficient assessment of docking methods. To address these gaps, we introduce PoseX, an open-source benchmark for evaluating both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; secondly, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); thirdly, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourthly, we established a public leaderboard to rank submitted models in real-time. We derived some key insights and conclusions through extensive experiments: (1) AI-based approaches consistently outperform physics-based methods in overall docking success rate. (2) Most intra- and intermolecular clashes of AI-based approaches can be greatly alleviated with relaxation, which means combining AI modeling with physics-based post-processing could achieve excellent performance. (3) AI co-folding methods exhibit ligand chirality issues, except for Boltz-1x, which introduced physics-inspired potentials to fix hallucinations, suggesting that stereochemical modeling greatly improves the structural plausibility of the predicted protein-ligand complexes. (4) Specifying binding pockets significantly promotes docking performance, indicating that pocket information can be leveraged adequately, particularly for AI co-folding methods, in future modeling efforts.
SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset
Pablo Lemos ⋅ Zane Beckwith ⋅ Sasaank Bandi ⋅ Maarten Van Damme ⋅ Jordan Crivelli-Decker ⋅ Benjamin Shields ⋅ Thomas Merth ⋅ Punit Jha ⋅ Nicola De Mitri ⋅ Tiffany Callahan ⋅ AJ Nish ⋅ Paul Abruzzo ⋅ Romelia Salomon-Ferrer ⋅ Martin Ganahl
Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Structurally Augmented IC50 Repository (SAIR), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset comprises $5,244,285$ structures across $1,048,857$ unique protein-ligand systems, curated from the ChEMBL and BindingDB databases, which were then computationally folded using the Boltz-1x model. We provide a comprehensive characterization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately $3 \%$ of structures exhibit physical anomalies, predominantly related to internal energy violations. As an initial demonstration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, neither exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and physical underpinnings of protein-ligand interactions. The link to the data will be added upon publication, to preserve anonymity of the submission.
FACET: A Fragment-Aware Conformer Ensemble Transformer
Duy Nguyen ⋅ Trung Nguyen ⋅ Hong Ha Le ⋅ Mai Truong ⋅ TrungTin Nguyen ⋅ Nhat Ho ⋅ Khoa Doan ⋅ Duy Duong-Tran ⋅ Li Shen ⋅ Daniel Sonntag ⋅ James Y Zou ⋅ Mathias Niepert ⋅ Hyojin Kim ⋅ Jonathan Allen
Accurately predicting molecular properties requires effective integration of structural information from both 2D molecular graphs and their corresponding equilibrium conformer ensembles. In this work, we propose FACET, a scalable Structure-Aware Graph Transformer that efficiently aggregates features from multiple 3D conformers while incorporating fragment-level information from 2D graphs. Unlike prior methods that rely on static geometric solvers or rigid fusion strategies, our approach utilizes a differentiable graph transformer to theoretically approximate the computationally expensive Fused Gromov–Wasserstein (FGW), enabling dynamic and scalable fusion of 2D and 3D structural information. We further enhance this mechanism by injecting fragment-specific structural priors into the attention layers, enabling the model to capture fine-grained molecular details. This unified design scales to large datasets, handling up to 75,000 molecules and hundreds of thousands of conformers, and provides over a 6× speedup compared to geometry-aware FGW-based baselines. Our method also achieves state-of-the-art results in molecular property prediction, Boltzmann-weighted ensemble modeling, and reaction-level tasks, and is particularly effective on chemically diverse compounds, including organocatalysts and transition-metal complexes.
Tracking Equivalent Mechanistic Interpretations Across Neural Networks
Alan Sun ⋅ Mariya Toneva
Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop an algorithm to estimate interpretive equivalence and case study its use on Transformer-based models. To analyze our algorithm, we introduce necessary and sufficient conditions for interpretive equivalence based on models' representation similarity. We provide guarantees that simultaneously relate a model's algorithmic interpretations, circuits, and representations. Our framework lays a foundation for the development of more rigorous evaluation methods of MI and automated, generalizable interpretation discovery methods.
Interpolation-Based Conditioning of Flow Matching Models for Bioisosteric Ligand Design
Yael Ziv ⋅ Martin Buttenschoen ⋅ Lukas Scheibelberger ⋅ Brian Marsden ⋅ Charlotte Deane
Fast, unconditional 3D generative models can now produce high-quality molecules, but adapting them for specific design tasks often requires costly retraining. To address this, we introduce Interpolate-Integrate and Replacement Guidance, two training-free, inference-time conditioning strategies that provide control over E(3)-equivariant flow-matching models. Our methods generate bioisosteric 3D molecules by conditioning on seed ligands or fragment sets to preserve key determinants like shape and pharmacophore patterns, without requiring the original fragment atoms to be present. We demonstrate their effectiveness on three drug-relevant tasks: natural product ligand hopping, bioisosteric fragment merging, and pharmacophore merging.
DCFold: Efficient Protein Structure Generation with Single Forward Pass
Zhe Zhang ⋅ Yuanning Feng ⋅ Yuxuan Song ⋅ Keyue Qiu ⋅ Hao Zhou ⋅ Wei-Ying Ma
AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15× acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.
Towards a Transferable Acceleration Method for Density Functional Theory
Zhe Liu ⋅ Yuyan Ni ⋅ Zhichen Pu ⋅ Qiming Sun ⋅ Siyuan Liu ⋅ Wen Yan
Recently, sophisticated deep learning-based approaches have been developed for generating efficient initial guesses to accelerate the convergence of density functional theory (DFT) calculations. While the actual initial guesses are often density matrices (DM), quantities that can convert into density matrices also qualify as alternative forms of initial guesses. Hence, existing works mostly rely on the prediction of the Hamiltonian matrix for obtaining high-quality initial guesses. However, the Hamiltonian matrix is both numerically difficult to predict and intrinsically non-transferable, hindering the application of such models in real scenarios. In light of this, we propose a method that constructs DFT initial guesses by predicting the electron density in a compact auxiliary basis representation using E(3)-equivariant neural networks. Trained exclusively on small molecules with up to 20 atoms, our model achieves an average 33.3% reduction in SCF iterations for molecules three times larger (up to 60 atoms). This result is particularly significant given that baseline Hamiltonian-based methods fail to generalize, often increasing the iteration count by over 80% or failing to converge entirely on these larger systems. Furthermore, we demonstrate that this acceleration is robustly scalable: the model successfully accelerates calculations for systems with up to 900 atoms (polymers and polypeptides) without retraining. To the best of our knowledge, this work represents the first and robust candidate for a universally transferable DFT acceleration method. We also released the SCFbench dataset and its accompanying code to facilitate future research in this promising direction.
Verifier-Constrained Flow Expansion for Discovery Beyond the Data
Riccardo De Santi ⋅ Kimon Protopapas ⋅ Ya-Ping Hsieh ⋅ Andreas Krause
Flow and diffusion models are typically pre-trained on limited available data (e.g., molecular samples), covering only a fraction of the valid design space (e.g., the full molecular space). As a consequence, they tend to generate samples from only a narrow portion of the feasible domain. This is a fundamental limitation for scientific discovery applications, where one typically aims to sample valid designs beyond the available data distribution. To this end, we address the challenge of leveraging access to a verifier (e.g., an atomic bonds checker), to adapt a pre-trained flow model so that its induced density expands beyond regions of high data availability, while preserving samples validity. We introduce formal notions of strong and weak verifiers and propose algorithmic frameworks for global and local flow expansion via probability-space optimization. Then, we present Flow Expander (FE), a scalable mirror descent scheme that provably tackles both problems by verifier-constrained entropy maximization over the flow process noised state space. Next, we provide a thorough theoretical analysis of the proposed method, and state convergence guarantees under both idealized and general assumptions. Ultimately, we empirically evaluate our method on both illustrative, yet visually interpretable settings, and on a molecular design task showcasing the ability of FE to expand a pre-trained flow model increasing conformer diversity while preserving validity.
Greater than the Sum of Its Parts: Building Substructure into Protein Encoding Models
Robert Calef ⋅ Arthur Liang ⋅ Manolis Kellis ⋅ Marinka Zitnik
Protein representation learning has achieved major advances using large sequence and structure datasets, yet current models primarily operate at the level of individual residues or entire proteins. This overlooks a critical aspect of protein biology: proteins are composed of recurrent, evolutionarily conserved substructures that mediate core molecular functions. Despite decades of curated biological knowledge, these substructures remain largely unexploited in modern protein models. We introduce Magneton, an integrated environment for developing substructure-aware protein models. Magneton provides (1) a large-scale dataset of 530,601 proteins annotated with over 1.7 million substructures spanning 13,075 types, (2) a training framework for incorporating substructures into existing models, and (3) a benchmark suite of 13 tasks probing residue-, substructure-, and protein-level representations. Using Magneton, we develop substructure-tuning, a supervised fine-tuning method that distills substructural knowledge into pretrained protein models. Across state-of-the-art sequence- and structure-based models, substructure-tuning improves function-related tasks while revealing that substructural signals are complementary to global structural information. The Magneton environment, datasets, and substructure-tuned models are all openly available.
Protein Structure Tokenization via Geometric Byte Pair Encoding
Michael Sun ⋅ Weize Yuan ⋅ Gang Liu ⋅ Wojciech Matusik ⋅ Marinka Zitnik
Protein structure is central to biological function, and enabling multimodal protein models requires joint reasoning over sequence, structure, and function. A key barrier is the lack of principled protein structure tokenizers (PSTs): existing approaches fix token size or rely on continuous vector codebooks, limiting interpretability, multi-scale control, and transfer across architectures. We introduce GeoBPE, a geometry-grounded PST that transforms continuous, noisy, multi-scale backbone conformations into discrete ``sentences'' of geometry while enforcing global constraints. Analogous to byte-pair encoding, GeoBPE generates a hierarchical vocabulary of geometric primitives by iteratively (i) clustering Geo-Pair occurrences with k-medoids to yield a resolution-controllable vocabulary; (ii) quantizing each Geo-Pair to its closest medoid prototype; and (iii) reducing drift through differentiable inverse kinematics that optimizes boundary glue angles under an $\mathrm{SE}(3)$ end-frame loss. GeoBPE offers compression ($>$10× reduction in bits-per-residue at similar distortion rate), data efficiency ($>$10× less training data), and generalization (maintains test/train distortion ratio of $1.0-1.1$). It is architecture-agnostic: (a) its hierarchical vocabulary provides a strong inductive bias for coarsening residue-level embeddings from large PLMs into motif- and protein-level representations, consistently outperforming leading PSTs across $12$ tasks and $24$ test splits; (b) paired with a transformer, GeoBPE supports unconditional backbone generation via language modeling; and (c) tokens align with CATH functional families and support expert-interpretable case studies, offering functional meaning absent in prior PSTs.
Efficient Prediction of Large Protein Complexes via Subunit-Guided Hierarchical Refinement
Chixiang Lu ⋅ Yunhua Zhong ⋅ Shikang Liang ⋅ XIAOJUAN QI ⋅ Haibo Jiang
State-of-the-art protein structure predictors have revolutionized structural biology, yet quadratic memory growth with token length makes end-to-end inference impractical for large complexes beyond a few thousand tokens. We introduce HierAFold, a hierarchical pipeline that exploits the modularity of large complexes via PAE-guided (Predicted Aligned Error) subunit decomposition, targeted interface-aware refinement, and confidence-weighted assembly. PAE maps localize rigid intra-chain segments and sparse inter-chain interfaces, enabling joint refinement of likely interacting subunits to capture multi-body cooperativity without increasing memory. HierAFold matches AlphaFold3 accuracy, raises success rates from 49.9\% (CombFold) to 73.1\% on recent PDB set. While for large complexes, it cuts peak memory by $\sim$25\,GB on a 4,000-token target ($\sim$40\%), successfully models complexes with over $5{,}000$ tokens that are out-of-memory for AlphaFold3, and raises success rates by two-fold compared with CombFold.
Evaluating Machine Learned Inter-Atomic Potentials for a Practical Simulation Workflow
Richard Strunk ⋅ Karnik Ram ⋅ Daniel Cremers
MLIPs are a promising paradigm in atomistic simulation, potentially offering the accuracy of ab-initio methods at the speed of empirical potentials. In this blog post, we give an overview of recent MLIP architectures, followed by an evaluation on a practical CO2 adsorption simulation. We find that as of today these models, though promising, are far from plug-and-play, requiring significant engineering effort to operate within established simulation frameworks, while also failing to produce physically consistent results.
Training Dynamics of Learning 3D-Rotational Equivariance
Max W. Shen · Ewa Nowara · Michael Maser · Kyunghyun Cho
While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.
Zephyrus: An Agentic Framework for Weather Science
Sumanth Varambally ⋅ Marshall Fisher ⋅ Jas Thakker ⋅ Yiwei Chen ⋅ Zhirui Xia ⋅ Yasaman Jafari ⋅ Ruijia Niu ⋅ Manas Jain ⋅ Veeramakali Vignesh Manivannan ⋅ Zachary Novack ⋅ Luyu Han ⋅ Srikar Eranky ⋅ Salva Rühling Cachay ⋅ Taylor Berg-Kirkpatrick ⋅ Duncan Watson-Parris ⋅ Yian Ma ⋅ Rose Yu
Foundation models for weather science are pre-trained on vast amounts of structured numerical data and outperform traditional weather forecasting systems. However, these models lack language-based reasoning capabilities, limiting their utility in interactive scientific workflows. Large language models (LLMs) excel at understanding and generating text but cannot reason about high-dimensional meteorological datasets. We bridge this gap by building the first agentic framework for weather science. Our framework includes a Python code-based environment for agents (ZephyrusWorld) to interact with weather data, featuring tools including a WeatherBench 2 dataset indexer, geolocator for geocoding from natural language, weather forecasting, climate simulation capabilities, and a climatology module for querying precomputed climatological statistics (e.g., means, extremes, and quantiles) across multiple timescales. We design Zephyrus, a multi-turn LLM-based weather agent that iteratively analyzes weather datasets, observes results, and refines its approach through conversational feedback loops. We accompany the agent with a new benchmark, ZephyrusBench, with a scalable data generation pipeline that constructs diverse question-answer pairs across weather-related tasks, from basic lookups to advanced forecasting, extreme event detection, and counterfactual reasoning. Experiments on this benchmark demonstrate the strong performance of Zephyrus agents over text-only baselines, outperforming them by up to 44 percentage points in correctness. However, the hard tasks are still difficult even with frontier LLMs, highlighting the challenging nature of our benchmark and suggesting room for future development. Our codebase and benchmark are available at \url{https://github.com/Rose-STL-Lab/Zephyrus}.
High Probability Bounds for Non-Convex Stochastic Optimization with Momentum
Shaojie Li ⋅ Pengwei Tang ⋅ Bowei Zhu ⋅ Yong Liu
Stochastic gradient descent with momentum (SGDM) is widely used in machine learning, yet high-probability learning bounds for SGDM in non-convex settings remain scarce. In this paper, we provide high-probability convergence bounds and generalization bounds for SGDM. First, we establish such bounds for the gradient norm in the general non-convex case. The resulting convergence bounds are tighter than existing theoretical results, and the obtained generalization bounds seem to be the first for SGDM. Next, under the Polyak-{\L}ojasiewicz condition, we derive bounds for the function-value error instead of the gradient norm, and the corresponding learning rates are faster than in the general non-convex case. Finally, by additionally assuming a mild Bernstein condition on the gradient, we obtain even sharper generalization bounds whose learning rates can reach $\widetilde{\mathcal{O}}(1/n^2)$ in the low-noise regime, where $n$ is the sample size. Overall, we provide a systematic study of high-probability learning bounds for non-convex SGDM.
FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching
Bernardo Perrone Ribeiro ⋅ Jana Faganeli Pucer
Radar-based precipitation nowcasting, the task of forecasting short-term precipitation fields from previous radar images, is a critical problem for flood risk management and decision-making. While deep learning has substantially advanced this field, two challenges remain fundamental: the uncertainty of atmospheric dynamics and the efficient modeling of high-dimensional data. Diffusion models have shown strong promise by producing sharp, reliable forecasts, but their iterative sampling process is computationally prohibitive for time-critical applications. We introduce FlowCast, the first end-to-end probabilistic model leveraging Conditional Flow Matching (CFM) as a direct noise-to-data generative framework for precipitation nowcasting. Unlike hybrid approaches, FlowCast learns a direct noise-to-data mapping in a compressed latent space, enabling rapid, high-fidelity sample generation. Our experiments demonstrate that FlowCast establishes a new state-of-the-art in probabilistic performance while also exceeding deterministic baselines in predictive accuracy. A direct comparison further reveals the CFM objective is both more accurate and significantly more efficient than a diffusion objective on the same architecture, maintaining high performance with significantly fewer sampling steps. This work positions CFM as a powerful and practical alternative for high-dimensional spatiotemporal forecasting.
HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
Hiren Madhu ⋅ João Felipe Rocha ⋅ Tinglin Huang ⋅ Siddharth Viswanath ⋅ Smita Krishnaswamy ⋅ Rex Ying
Single-cell transcriptomics and proteomics have become a great source for data-driven insights into biology, enabling the use of advanced deep learning methods to understand cellular heterogeneity and gene expression at the single-cell level. With the advent of spatial-omics data, we have the promise of characterizing cells within their tissue context as it provides both spatial coordinates and intra-cellular transcriptional or protein counts. Beyond transcriptomics, proteomics offers a complementary view by directly measuring proteins, which are the primary effectors of cellular function and key therapeutic targets. However, existing models either ignore the spatial information or the complex genetic and proteomic programs within cells. Thus they cannot infer how cell internal regulation adapts to microenvironmental cues. Furthermore, these models often utilize fixed gene vocabularies, hindering their generalizability to datasets with different genes than pretraining. In this paper, we introduce HEIST, a hierarchical graph transformer foundation model for spatial transcriptomics and proteomics. HEIST models tissues as hierarchical graphs. The higher level graph is a spatial cell graph, and each cell in turn, is represented by its lower level gene co-expression network graph. Rather than using a fixed gene vocabulary, HEIST computes gene embeddings from its co-expression network and cellular context. HEIST achieves this by performing both intra-level and cross-level message passing to utilize the hierarchy in its embeddings and can thus generalize to novel datatypes including spatial proteomics without retraining. HEIST is pretrained on 22.3M cells from 124 tissues across 15 organs using spatially-aware contrastive and masked autoencoding objectives. Unsupervised analysis of HEIST embeddings reveals spatially informed subpopulations missed by prior models. Downstream evaluations demonstrate generalizability to proteomics data and state-of-the-art performance in clinical outcome prediction, cell type annotation, and gene imputation across multiple technologies.
SigmaDock: Untwisting Molecular Docking with Fragment-Based SE(3) Diffusion
Alvaro Prat ⋅ Leo Zhang ⋅ Charlotte Deane ⋅ Yee Whye Teh ⋅ Garrett Morris
Determining the binding pose of a ligand to a protein, known as molecular docking, is a fundamental task in drug discovery. Generative approaches promise faster, improved, and more diverse pose sampling than physics-based methods, but are often hindered by chemically implausible outputs, poor generalisability, and high computational cost. To address these challenges, we introduce a novel fragmentation scheme, leveraging inductive biases from structural chemistry, to decompose ligands into rigid-body fragments. Building on this decomposition, we present SigmaDock, an SE(3) Riemannian diffusion model that generates poses by learning to reassemble these rigid bodies within the binding pocket. By operating at the level of fragments in SE(3), SigmaDock exploits well-established geometric priors while avoiding overly complex diffusion processes and unstable training dynamics. Experimentally, we show SigmaDock achieves state-of-the-art performance, reaching Top-1 success rates (RMSD $<2$ \& PB-valid) above 79.9\% on the PoseBusters set, compared to 12.7-32.8\% reported by recent deep learning approaches, whilst demonstrating consistent generalisation to unseen proteins. SigmaDock is the first deep learning approach to surpass classical physics-based docking under the PB train-test split, marking a significant leap forward in the reliability and feasibility of deep learning for molecular modelling.
RIDER: 3D RNA Inverse Design with Reinforcement Learning-Guided Diffusion
Tianmeng Hu ⋅ Yongzheng Cui ⋅ Biao Luo ⋅ Ke Li
The inverse design of RNA three-dimensional (3D) structures is crucial for engineering functional RNAs in synthetic biology and therapeutics. While recent deep learning approaches have advanced this field, they are typically optimized and evaluated using native sequence recovery, which is a limited surrogate for structural fidelity, since different sequences can fold into similar 3D structures and high recovery does not necessarily indicate correct folding. To address this limitation, we propose RIDER, an RNA Inverse DEsign framework with Reinforcement learning that directly optimizes for 3D structural similarity. First, we develop and pre-train a GNN-based generative diffusion model conditioned on the target 3D structure, achieving a $9\\%$ improvement in native sequence recovery over state-of-the-art methods. Then, we fine-tune the model with an improved policy gradient algorithm using four task-specific reward functions based on 3D self-consistency metrics. Experimental results show that RIDER improves structural similarity by over $100\\%$ across all metrics and discovers designs that are distinct from native sequences.
CDBridge: A Cross-omics Post-training Bridge Strategy for Context-aware Biological Modeling
Chang Yu ⋅ Siyuan Li ⋅ Zicheng Liu ⋅ Jingbo Zhou ⋅ Xianglong Guo ⋅ Kai Yu ⋅ Yuqing Zhou ⋅ Ken Li ⋅ Zelin Zang ⋅ Zhen Lei ⋅ Stan Z Li
Linking genomic DNA to quantitative, context-specific expression remains a central challenge in computational biology. Current foundation models capture either tissue context or sequence features, but not both. Cross-omics systems, in turn, often overlook critical mechanisms such as alternative splicing and isoform reuse. We present CDBridge, a post-training strategy that unifies pretrained DNA and protein models into a context-aware framework without full retraining. CDBridge operates in two stages: (a) Seq-context learning, where a splicing-inspired token merge compresses long genomic regions into isoform-aware representations, and (b) Env-context learning, where a conditional decoder injects tissue embeddings to model expression under diverse biological contexts. To benchmark this setting, we introduce GTEx-Benchmark, derived from GTEx and Ensembl, which requires models to capture long-range exon dependencies, resolve isoform reuse, and predict tissue-specific expression levels. Across qualitative and quantitative tasks, CDBridge consistently outperforms prior methods that ignore central dogma constraints or context dependence, offering a scalable and biologically faithful solution for DNA-to-expression modeling.
GeneBreaker: Jailbreak Attacks against DNA Language Models with Pathogenicity Guidance
ZAIXI ZHANG ⋅ Zhenghong Zhou ⋅ Ruofan Jin ⋅ Le Cong ⋅ Mengdi Wang
DNA, encoding genetic instructions for almost all living organisms, fuels groundbreaking advances in genomics and synthetic biology. Recently, DNA Language Models have achieved success in designing synthetic functional DNA sequences, even whole genomes of novel bacteriophage, verified with wet lab experiments. Such remarkable generative power also brings severe biosafety concerns about whether DNA language models can design human viruses. With the goal of exposing vulnerabilities and informing the development of robust safeguarding techniques, we perform a systematic biosafety evaluation of DNA language models through the lens of jailbreak attacks. Specifically, we introduce JailbreakDNABench, a benchmark centered on high-priority human viruses, together with an end-to-end jailbreak framework, GeneBreaker. GeneBreaker integrates three key components: (1) an LLM agent equipped with customized bioinformatics tools to design high-homology yet non-pathogenic jailbreak prompts, (2) beam search guided by PathoLM and log-probability heuristics to steer sequence generation toward pathogen-like outputs, and (3) a BLAST- and function-annotation–based evaluation pipeline to identify successful jailbreaks. On JailbreakDNABench, GeneBreaker successfully jailbreaks the latest Evo series models across 6 viral categories consistently (up to 60\% Attack Success Rate for Evo2-40B). Further case studies on SARS-CoV-2 spike protein and HIV-1 envelope protein demonstrate the sequence and structural fidelity of jailbreak output, while evolutionary modeling of SARS-CoV-2 underscores biosecurity risks. Our findings also reveal that scaling DNA language models amplifies dual-use risks, motivating enhanced safety alignment and tracing mechanisms.
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Weining Fu ⋅ Kai Shu ⋅ Kui Xu ⋅ Qiangfeng Zhang
Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies. However, the exponential growth in cryo-EM data throughput and complexity, coupled with diverse downstream analytical tasks, necessitates unified computational frameworks that transcend current task-specific deep learning approaches with limited scalability and generalizability. We present CryoLVM, a foundation model that learns rich structural representations from experimental density maps with resolved structures by leveraging the Joint-Embedding Predictive Architecture (JEPA) integrated with a SCUNet-based backbone, enabling rapid adaptation to various downstream tasks. We further introduce a novel histogram-based distribution alignment loss that accelerates convergence and enhances fine-tuning performance. Through comprehensive evaluation across three critical cryo-EM tasks—density map sharpening, density map super-resolution, and missing wedge restoration—CryoLVM consistently outperforms state-of-the-art baselines across multiple density map quality metrics, confirming its potential as a versatile foundation model for a wide spectrum of cryo-EM applications.
CP-Agent: Context‑Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
Yuxin Zhang ⋅ Yiyao Li ⋅ Ping Ho ⋅ Simon See ⋅ Zhenqin Wu ⋅ Kevin Tsia
Cell Painting combines multiplexed fluorescent staining, high‑content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug–disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP‑Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent’s potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening---streamlining iterative cycles of hypothesis generation in drug discovery.
SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Jiahao Zhao ⋅ Feng Jiang ⋅ Shaowei Qin ⋅ Zhonghui Zhang ⋅ Junhao Liu ⋅ Guibing Guo ⋅ Hamid Alinejad-Rokny ⋅ Min Yang
Large language models (LLMs) are increasingly applied in scientific research, offering new capabilities for knowledge discovery and reasoning. In single-cell biology, however, evaluation practices for both general and specialized LLMs remain inadequate: existing benchmarks are fragmented across tasks, adopt formats such as multiple-choice classification that diverge from real-world usage, and rely on metrics lacking interpretability and biological grounding. We present SC-ARENA, a natural language evaluation framework tailored to single-cell foundation models. SC-ARENA formalizes a virtual cell abstraction that unifies evaluation targets by representing both intrinsic attributes and gene-level interactions. Within this paradigm, we define five natural language tasks — cell type annotation, captioning, generation, perturbation prediction, and scientific QA — that probe core reasoning capabilities in cellular biology. To overcome the limitations of brittle string-matching metrics, we introduce knowledge-augmented evaluation, which incorporates external ontologies, marker databases, and scientific literature to support biologically faithful and interpretable judgments. Experiments and analysis across both general-purpose and domain-specialized LLMs demonstrate that: (i) under the Virtual Cell unified evaluation paradigm, current models achieve uneven performance on biologically complex tasks, particularly those demanding mechanistic or causal understanding; and (ii) our knowledge-augmented evaluation framework ensures biological correctness, provides interpretable, evidence-grounded rationales, and achieves high discriminative capacity, overcoming the brittleness and opacity of conventional metrics. SC-ARENA thus provides a unified and interpretable framework for assessing LLMs in single-cell biology, pointing toward the development of biology-aligned, generalizable foundation models.
WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
Qiangwei Peng ⋅ Zihan Wang ⋅ Junda Ying ⋅ Yuhao Sun ⋅ Qing Nie ⋅ Lei Zhang ⋅ Tiejun Li ⋅ Peijie Zhou
The Wasserstein–Fisher–Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce \textbf{WFR Flow Matching (WFR-FM)}, a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth–death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time. The Python code is available at .
TusoAI: Agentic Optimization for Scientific Methods
Alistair Turcan ⋅ Kexin Huang ⋅ Lei Li ⋅ Martin J. Zhang
Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test mod- eling and scientific assumptions against empirical data, and implement these in- sights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate com- putational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on de- veloping computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific do- mains. Here, we introduce TusoAI, an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific op- timization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scien- tific AI agents across diverse tasks. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered new biology missed by previous methods. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.
Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang ⋅ Heng Guo ⋅ Zheng Jiang ⋅ Chunming He ⋅ Xiu Li ⋅ Minfeng Xu
Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.
Can Large Language Models Match the Conclusions of Systematic Reviews?
Christopher Polzak ⋅ Alejandro Lozano ⋅ Min Sun ⋅ James Burgess ⋅ Yuhui Zhang ⋅ Kevin Wu ⋅ Chia-Chun Chiang ⋅ Jeffrey J. Nirschl ⋅ Serena Yeung
Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 medical SRs with the studies they are based on. We benchmark 25 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialists, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians.
Moving Beyond Medical Exams: A Clinician-Annotated Fairness Dataset of Real-World Tasks and Ambiguity in Mental Healthcare
Max Lamparth ⋅ Declan Grabb ⋅ Amy Franks ⋅ Scott Gershan ⋅ Kaitlyn Kunstman ⋅ Aaron Lulla ⋅ Monika Roots ⋅ Manu Sharma ⋅ Aryan Shrivastava ⋅ Nina Vasan ⋅ Colleen Waickman
Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S. centric dataset — created without any LM assistance — is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables, e.g., for age or ethnicity, and are available for male, female, or non-binary-coded patients. This design enables systematic evaluations of model performance and bias by studying how demographic factors affect decision-making. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating sixteen off-the-shelf and six (mental) health fine-tuned LMs on category-specific task accuracy, on the fairness impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human-annotated samples.
Random Anchors with Low-rank Decorrelated Learning: A Minimalist Pipeline for Class-Incremental Medical Image Classification
Xinyao Wu ⋅ Zhe Xu ⋅ Raymond Kai-yu Tong
Class-incremental learning (CIL) in medical image-guided diagnosis requires models to preserve knowledge of historical disease classes while adapting to emerging categories. Pre-trained models (PTMs) with well-generalized features provide a strong foundation, yet most PTM-based CIL strategies, such as prompt tuning, task-specific adapters and model mixtures, rely on increasingly complex designs. While effective in general-domain benchmarks, these methods falter in medical imaging, where low intra-class variability and high inter-domain shifts (from scanners, protocols and institutions) make CIL particularly prone to representation collapse and domain misalignment. Under such conditions, we find that lightweight representation calibration strategies, often dismissed in general-domain CIL for their modest gains, can be remarkably effective for adapting PTMs in medical settings. To this end, we introduce Random Anchors with Low-rank Decorrelated Learning (RA-LDL), a minimalist representation-based framework that combines (a) PTM-based feature extraction with optional ViT-Adapter tuning, (b) feature calibration via frozen Random Anchor projection and a single-session-trained Low-Rank Projection (LRP), and (c) analytical closed-form decorrelated learning. The entire pipeline requires only one training session and minimal task-specific tuning, making it appealing for efficient deployment. Despite its simplicity, RA-LDL achieves consistent and substantial improvements across both general-domain and medical-specific PTMs, and outperforms recent state-of-the-art methods on four diverse medical imaging datasets. These results highlight that minimalist representation recalibration, rather than complex architectural modifications, can unlock the underexplored potential of PTMs in medical CIL. We hope this work establishes a practical and extensible foundation for future research in class-incremental image-guided diagnosis. Code will be made publicly available.
OpenPros: A Large-Scale Dataset for Limited View Prostate Ultrasound Computed Tomography
Hanchen Wang ⋅ Yixuan Wu ⋅ Yinan Feng ⋅ Peng Jin ⋅ Luoyuan Zhang ⋅ Shihang Feng ⋅ James Wiskin ⋅ Baris Turkbey ⋅ Peter Pinto ⋅ Bradford Wood ⋅ Songting Luo ⋅ Yinpeng Chen ⋅ Emad Boctor ⋅ Youzuo Lin
Prostate cancer is one of the most common and lethal cancers among men, making its early detection critically important. Ultrasound computed tomography (USCT) has emerged as an accessible and cost-effective method that reconstructs quantitative tissue parameters, which can serve as potential biomarkers for malignancy. However, current prostate USCT faces considerable barriers: limited-angle acquisitions due to anatomical constraints, tissue heterogeneity, proximity to organs and bony pelvic structures, and lengthy processing times. The lack of large-scale, anatomically precise datasets significantly hampers the development of high-quality, efficient, and generalizable methods. To address this gap, we introduce OpenPros, the first large-scale benchmark dataset for limited-angle prostate USCT, designed to evaluate machine learning algorithms for inverse problems systematically. Our dataset includes over 280,000 paired samples of realistic 2D speed-of-sound (SOS) phantoms and corresponding ultrasound full-waveform data, generated from anatomically accurate 3D digital prostate models derived from 4 real clinical MRI/CT scans and 62 ex vivo prostate specimens with experimental ultrasound measurements, annotated by medical experts. Simulations are conducted under clinically realistic configurations using advanced finite-difference time-domain (FDTD) and Runge-Kutta acoustic wave solvers, both provided as open-source components. Through comprehensive benchmarking, we find that deep learning methods significantly outperform traditional physics-based algorithms in inference efficiency and reconstruction accuracy. However, our results also reveal that current machine learning methods fail to deliver clinically acceptable, high-resolution reconstructions, underscoring critical gaps in generalization, robustness, and uncertainty quantification. By publicly releasing OpenPros, we provide the community with a rigorous benchmark that not only enables fair method comparison but also motivates new advances in physics-informed learning, foundation models for scientific imaging, and uncertainty-aware reconstruction—bridging the gap between academic ML research and real-world clinical deployment. The dataset is publicly accessible at https://open-pros.github.io/.
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Anjie Le ⋅ Henan Liu ⋅ Yue Wang ⋅ Zhenyu Liu ⋅ Rongkun Zhu ⋅ Taohan Weng ⋅ Jinze Yu ⋅ Boyang Wang ⋅ Yalun Wu ⋅ Kaiwen Yan ⋅ Quanlin Sun ⋅ Meirui Jiang ⋅ Jialun Pei ⋅ Siya Liu ⋅ Haoyun Zheng ⋅ Zhoujun Li ⋅ J. Alison Noble ⋅ Jacques Souquet ⋅ Xiaoqing Guo ⋅ Manxi Lin ⋅ Hongcheng Guo
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
KnowGuard: Knowledge-Driven Abstention for Multi-Round Clinical Reasoning
Xilin Dang ⋅ Kexin Chen ⋅ Xiaorui Su ⋅ Ayush Noori ⋅ Iñaki Arango ⋅ Lucas Vittor ⋅ LONG XINYI ⋅ Yuyang Du ⋅ Marinka Zitnik ⋅ Pheng-Ann Heng
In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences. To address this, we propose \textbf{KnowGuard}, a novel \textit{investigate-before-abstain} paradigm that integrates systematic knowledge graph exploration for clinical decision-making. Our approach consists of two key stages operating on a shared contextualized evidence pool: 1) an evidence discovery stage that systematically explores the medical knowledge space through graph expansion and direct retrieval, and 2) an evidence evaluation stage that ranks evidence using multiple factors to adapt exploration based on patient context and conversation history. This two-stage approach enables systematic knowledge graph exploration, allowing models to trace structured reasoning paths and recognize insufficient medical evidence. We evaluate our abstention approach using open-ended multi-round clinical benchmarks that mimic realistic diagnostic scenarios, assessing abstention quality through accuracy-efficiency trade-offs beyond existing closed-form evaluations. Experimental evidence clearly demonstrates that KnowGuard outperforms state-of-the-art abstention approaches, improving diagnostic accuracy by 3.93\% through effective diagnostic interactions averaging 5.74 conversation turns.
AbdCTBench: Learning Clinical Biomarker Representations from Abdominal Surface Geometry
Muhammad Ahmed Chaudhry ⋅ Suhana Bedi ⋅ Pola Lydia Lagari ⋅ Brian Layden ⋅ William Galanter ⋅ Ayis Pyrros ⋅ Sanmi Koyejo
Body composition analysis through CT and MRI imaging provides critical insights for cardio-metabolic health assessment but remains limited by accessibility barriers including radiation exposure, high costs, and infrastructure requirements. We present AbdCTBench, a large-scale dataset containing 23,506 CT-derived abdominal surface meshes from 18,719 patients, paired with 87 comorbidity labels, 31 specific diagnosis codes, and 16 CT-derived biomarkers. Our key insight is that external surface geometry is predictive of internal tissue composition, enabling accessible health screening through consumer devices. We establish comprehensive benchmarks across seven computer vision architectures (ResNet-18/34/50, DenseNet-121, EfficientNet-B0, ViT-Small, Swin Transformer-Base), demonstrating that models can learn robust surface-to-biomarker representations directly from 2D mesh projections. Our best-performing models achieve clinically relevant accuracy: age prediction with MAE 6.22 years (R²=0.757), mortality prediction with AUROC 0.839, and diabetes (with chronic complications) detection with AUROC 0.801. Notably, smaller architectures consistently matched or surpassed larger models, while medical-domain pre-training (RadImageNet) and self-supervised pre-training (DINOv2) showed competitive but not superior performance. AbdCTBench represents the largest publicly available dataset bridging external body geometry with internal clinical measurements, enabling future research in accessible medical AI. We plan to release the dataset, evaluation protocols, and baseline models to accelerate research in representation learning for medical applications, immediately following the review period.
MedVR: Annotation-Free Medical Visual Reasoning via Agentic Reinforcement Learning
Zheng Jiang ⋅ Heng Guo ⋅ Chengyu Fang ⋅ Changchen Xiao ⋅ Xinyang Hu ⋅ Lifeng Sun ⋅ Minfeng Xu
Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement. Without any human annotations for intermediate steps, MedVR achieves state-of-the-art performance on diverse public medical VQA benchmarks, significantly outperforming existing models. By learning to reason directly with visual evidence, MedVR promotes the robustness and transparency essential for accelerating the clinical deployment of medical AI.
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Miao Jing ⋅ Mengting Jia ⋅ Junling Lin ⋅ Zhongxia Shen ⋅ Huan Gao ⋅ Mingkun Xu ⋅ Shangyang Li
Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
sleep2vec: Unified Cross-Modal Alignment for Heterogeneous Nocturnal Biosignals
Weixuan Yuan ⋅ Zengrui Jin ⋅ Yichen Wang ⋅ Donglin Xie ⋅ Ziyi Ye ⋅ Chao Zhang ⋅ Xuesong Chen
Tasks ranging from sleep staging to clinical diagnosis traditionally rely on standard polysomnography (PSG) devices, bedside monitors and wearable devices, which capture diverse nocturnal biosignals (e.g., EEG, EOG, ECG, SpO$_2$). However, heterogeneity across devices and frequent sensor dropout pose significant challenges for unified modelling of these multimodal signals. We present sleep2vec, a foundation model for diverse and incomplete nocturnal biosignals that learns a shared representation via cross-modal alignment. sleep2vec is contrastively pre-trained on 42,249 overnight recordings spanning nine modalities using a Demography, Age, Site & History-aware InfoNCE objective that incorporates physiological and acquisition metadata (e.g., age, gender, recording site) to dynamically weight negatives and mitigate cohort-specific shortcuts. On downstream sleep staging and clinical outcome assessment, sleep2vec consistently outperforms strong baselines and remains robust to any subset of available modalities and sensor dropout. We further characterize, to our knowledge for the first time, scaling laws for nocturnal biosignals with respect to modality diversity and model capacity. Together, these results show that unified cross-modal alignment, coupled with principled scaling, enables label-efficient, general-purpose modelling of real-world nocturnal biosignals.
MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health
Deli Yu ⋅ Shengzhi Wang ⋅ Kai WU ⋅ Xiaozhong Ji ⋅ Bo Cui ⋅ Jieqiong Cao ⋅ Huichao Wang ⋅ Boyuan Jiang ⋅ Xu Wang ⋅ Qian Xu ⋅ ChaoGao ⋅ Yi Zhao ⋅ Dian Chen ⋅ Meng Li ⋅ Haifeng Wu ⋅ Yijun He ⋅ HaiHuaYang
Body-surface health conditions, spanning diverse clinical departments, represent some of the most frequent diagnostic scenarios and a primary target for medical multimodal large language models (MLLMs). Yet existing medical benchmarks are either built from publicly available sources with limited expert curation or focus narrowly on disease classification, failing to reflect the stepwise recognition and reasoning processes physicians follow in real practice. To address this gap, we introduce MedLesionVQA, the first benchmark explicitly designed to evaluate MLLMs on the visual diagnostic workflow for body-surface conditions in large scale. All questions are derived from authentic clinical visual diagnosis scenarios and verified by medical experts with over 20 years of experience, while the data are drawn from 10k+ real patient visits, ensuring authenticity, clinical reality and diversity. MedLesionVQA consists of 12K in-house images (never publicly leaked) and 19K expert-verified question–answer pairs, with fine-grained annotations of 94 lesion types, 110 body regions, and 96 diseases. We evaluate 20+ state-of-the-art MLLMs against human physicians: the best model reaches 56.2% accuracy, far below primary physicians (61.4%) and senior specialists (73.2%). These results expose the persistent gap between MLLMs and clinical expertise, underscoring the need for the multimodal benchmarks to drive trustworthy medical AI. The dataset can be found in https://github.com/bytedance/MedLesionVQA.
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering
Yahan Li ⋅ Jifan Yao ⋅ John Bunyi ⋅ Adam Frank ⋅ Angel Hwang ⋅ Ruishan Liu
Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Expert evaluation of 1,080 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.
Critic–Adviser–Reviser Cyclic Refinement: Towards High-Quality EMR Corpus Generation with LLMs
Chen Ning ⋅ Xien Liu ⋅ Chenwei Yan ⋅ Xiao Zhang ⋅ Xinxin You ⋅ Yuxuan Zhou ⋅ xiangling fu ⋅ Ji Wu
Electronic medical records (EMRs) are vital for healthcare research, but their use is limited by privacy concerns. Synthetic EMR generation offers a promising alternative, yet most existing methods merely imitate real records without adhering to rigorous clinical quality principles. To address this, we introduce LLM-CARe, a stage-wise cyclic refinement framework that progressively improves EMR quality through three stages, each targeting a specific granularity: corpus, section and document. At each stage, a Critic, an Adviser, and a Reviser collaborate iteratively to evaluate, provide feedback, and refine the drafts. This structured, multi-stage process produces records that better satisfy clinical quality standards. Experiments show that LLM-CARe significantly enhances EMR quality across all levels compared to strong baselines and yields improved performance on real-world clinical tasks such as diagnosis prediction. Unlike prior work, our method requires no real EMR text for training or prompting, demonstrating the effectiveness of stage-wise, cyclic refinement for generating high-quality, privacy-preserving EMR datasets.
HistoPrism: Unlocking Functional Pathway Analysis from Pan-Cancer Histology via Gene Expression Prediction
Susu Hu ⋅ Qinghe Zeng ⋅ Nithya Bhasker ⋅ Jakob Nikolas Kather ⋅ Stefanie Speidel
Predicting spatial gene expression from H&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways. HistoPrism not only surpasses prior state-of-the-art models on highly variable genes and, but more importantly, achieves substantial gains on pathway-level prediction, demonstrating its ability to recover biologically coherent transcriptomic patterns. With strong pan-cancer generalization and improved efficiency, HistoPrism establishes a new standard for clinically relevant transcriptomic modeling from routinely available histology.
Identity-Free Deferral For Unseen Experts
Joshua Strong ⋅ Pramit Saha ⋅ Yasin Ibrahim ⋅ Cheng Ouyang ⋅ J. Alison Noble
Learning to Defer (L2D) improves AI reliability in decision-critical environments by training AI to either make its own prediction or defer the decision to a human expert. A key challenge is adapting to unseen experts at test time, whose competence can differ from the training population. Current methods for this task, however, can falter when unseen experts are out-of-distribution (OOD) relative to the training population. We identify a core architectural flaw as the cause: they learn identity-conditioned policies by processing class-indexed signals in fixed coordinates, creating shortcuts that violate the problem's inherent permutation symmetry. We introduce Identity-Free Deferral (IFD), an architecture that enforces this symmetry by construction. From a few-shot context, IFD builds a query-independent Bayesian competence profile for each expert. It then supplies the deferral rejector with a low-dimensional, role-indexed state containing only structural information, such as the model's confidence in its top-ranked class and the expert's estimated skill for that same role, which obscures absolute class identities. We train IFD using an uncertainty-aware, context-only objective that removes the need for expensive query-time expert labels. We formally prove the permutation invariance of our approach, contrasting it with the generic non-invariance of standard population encoders. Experiments on medical imaging benchmarks and ImageNet-16H with real human annotators show that IFD consistently improves generalisation to unseen experts, with gains in OOD settings, all while using fewer annotations than alternative methods.
Temporally Detailed Hypergraph Neural ODE for Disease Progression Modeling
Tingsong Xiao ⋅ Yao Lee ⋅ Zelin Xu ⋅ Yupu Zhang ⋅ Zibo Liu ⋅ Yu Huang ⋅ Jiang Bian ⋅ Jingchuan Guo ⋅ Zhe Jiang
Disease progression modeling aims to characterize and predict how a patient's disease complications worsen over time based on longitudinal electronic health records (EHRs). For diseases such as type 2 diabetes, accurate progression modeling can enhance patient sub-phenotyping and inform effective and timely interventions. However, the problem is challenging due to the need to learn continuous-time progression dynamics from irregularly sampled clinical events amid patient heterogeneity (e.g., different progression rates and pathways). Existing mechanistic and data-driven methods either lack adaptability to learn from real-world data or fail to capture complex continuous-time dynamics on progression trajectories. To address these limitations, we propose Temporally Detailed Hypergraph Neural Ordinary Differential Equation (TD-HNODE), which represents disease progression on clinically recognized trajectories as a temporally detailed hypergraph and learns the continuous-time progression dynamics via a neural ODE framework. TD-HNODE contains a learnable TD-Hypergraph Laplacian that captures the interdependency of disease complication markers within both intra- and inter-progression trajectories. Experiments on two real-world clinical datasets demonstrate that TD-HNODE outperforms multiple baselines in modeling the progression of type 2 diabetes and related cardiovascular diseases.
RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data
Peiyan Hu ⋅ Haodong Feng ⋅ Hongyuan Liu ⋅ Tongtong Yan ⋅ Wenhao Deng ⋅ Tianrun Gao ⋅ Rong Zheng ⋅ Haoren Zheng ⋅ Chenglei Yu ⋅ Chuanrui Wang ⋅ Kaiwen Li ⋅ Zhi-Ming Ma ⋅ Dezhi Zhou ⋅ Xingcai Lu ⋅ Dixia Fan ⋅ Tailin Wu
Predicting the evolution of complex physical systems remains a central problem in science and engineering. Despite rapid progress in scientific Machine Learning (ML) models, a critical bottleneck is the lack of expensive real-world data, resulting in most current models being trained and validated on simulated data. Beyond limiting the development and evaluation of scientific ML, this gap also hinders research into essential tasks such as sim-to-real transfer. We introduce RealPDEBench, the first benchmark for scientific ML that integrates real-world measurements with paired numerical simulations. RealPDEBench consists of five datasets, three tasks, nine metrics, and ten baselines. We first present five real-world measured datasets with paired simulated datasets across different complex physical systems. We further define three tasks, which allow comparisons between real-world and simulated data, and facilitate the development of methods to bridge the two. Moreover, we design nine evaluation metrics, spanning data-oriented and physics-oriented metrics, and finally benchmark ten representative baselines, including state-of-the-art models, pretrained PDE foundation models, and a traditional method. Experiments reveal significant discrepancies between simulated and real-world data, while showing that pretraining with simulated data consistently improves both accuracy and convergence. In this work, we hope to provide insights from real-world data, advancing scientific ML toward bridging the sim-to-real gap and real-world deployment. Our benchmark, datasets, and instructions are available at https://realpdebench.github.io/.
GenCP: Towards Generative Modeling Paradigm of Coupled physics
Tianrun Gao ⋅ Haoren Zheng ⋅ Wenhao Deng ⋅ Haodong Feng ⋅ Tao Zhang ⋅ Ruiqi Feng ⋅ Qianyi Chen ⋅ Tailin Wu
Real-world physical systems are inherently complex, often involving the coupling of multiple physics, making their simulation both highly valuable and challenging. Many mainstream approaches face challenges when dealing with decoupled data. Besides, they also suffer from low efficiency and fidelity in strongly coupled spatio-temporal physical systems. Here we propose GenCP, a novel and elegant generative paradigm for coupled multiphysics simulation. By formulating coupled-physics modeling as a probability modeling problem, our key innovation is to integrate probability density evolution in generative modeling with iterative multiphysics coupling, thereby enabling training on data from decoupled simulation and inferring coupled physics during sampling. We also utilize operator-splitting theory in the space of probability evolution to establish error controllability guarantees for this “conditional-to-joint” sampling scheme. We evaluate our paradigm on a synthetic setting and three challenging multiphysics scenarios to demonstrate both principled insight and superior application performance of GenCP. Code is available at this repo: https://github.com/AI4Science-WestlakeU/GenCP.
DRIFT-Net: A Spectral-Coupled Neural Operator for PDEs Learning
jiayi li ⋅ Flora Salim
Learning PDE dynamics with neural solvers can significantly improve wall-clock efficiency and accuracy compared with classical numerical solvers. In recent years, foundation models for PDEs have largely adopted multi-scale windowed self-attention, with the scOT backbone in Poseidon serving as a representative example. However, because of their locality, truly globally consistent spectral coupling can only be propagated gradually through deep stacking and window shifting. This weakens global coupling and leads to error accumulation and drift during closed-loop rollouts. To address this, we propose DRIFT-Net. It employs a dual-branch design comprising a spectral branch and an image branch. The spectral branch is responsible for capturing global, large-scale low-frequency information, whereas the image branch focuses on local details and nonstationary structures. Specifically, we first perform controlled, lightweight mixing within the low-frequency range. Then we fuse the spectral and image paths at each layer via bandwise weighting, which avoids the width inflation and training instability caused by naive concatenation. The fused result is transformed back into the spatial domain and added to the image branch, thereby preserving both global structure and high-frequency details across scales. Compared with strong attention-based baselines, DRIFT-Net achieves lower error and higher throughput with fewer parameters under identical training settings and budget. On Navier--Stokes benchmarks, the relative $L_{1}$ error is reduced by 7%-54%, the parameter count decreases by about 15%, and the throughput remains higher than scOT. Ablation studies and theoretical analyses further demonstrate the stability and effectiveness of this design. The code is available at https://github.com/cruiseresearchgroup/DRIFT-Net.
EGG-SR: Embedding Symbolic Equivalence into Symbolic Regression via Equality Graph
Nan Jiang ⋅ Ziyi Wang ⋅ Yexiang Xue
Symbolic regression seeks to uncover physical laws from experimental data by searching for closed-form expressions, which is an important task in AI-driven scientific discovery. Yet the exponential growth of the search space of expression renders the task computationally challenging. A promising yet underexplored direction for reducing the search space and accelerating training lies in *symbolic equivalence*: many expressions, although syntactically different, define the same function -- for example, $\log(x_1^2x_2^3)$, $\log(x_1^2)+\log(x_2^3)$, and $2\log(x_1)+3\log(x_2)$. Existing algorithms treat such variants as distinct outputs, leading to redundant exploration and slow learning. We introduce EGG-SR, a unified framework that integrates symbolic equivalence into a class of modern symbolic regression methods, including Monte Carlo Tree Search (MCTS), Deep Reinforcement Learning (DRL), and Large Language Models (LLMs). \method-SR compactly represents equivalent expressions through the proposed EGG module (via equality graphs), accelerating learning by: (1) pruning redundant subtree exploration in EGG-MCTS, (2) aggregating rewards across equivalent generated sequences in EGG-DRL, and (3) enriching feedback prompts in EGG-LLM. Theoretically, we show the benefit of embedding EGG into learning: it tightens the regret bound of MCTS and reduces the variance of the DRL gradient estimator. Empirically, EGG-SR consistently enhances a class of symbolic regression models across several benchmarks, discovering more accurate expressions within the same time limit. Project page is at: https://nan-jiang-group.github.io/egg-sr.
Unveiling the Mechanism of Continuous Representation Full-Waveform Inversion: A Wave Based Neural Tangent Kernel Framework
Ruihua Chen ⋅ Yisi Luo ⋅ Bangyu Wu ⋅ Deyu Meng
Full-waveform inversion (FWI) estimates physical parameters in the wave equation from limited measurements and has been widely applied in geophysical exploration, medical imaging, and non-destructive testing. Conventional FWI methods are limited by their notorious sensitivity to the accuracy of the initial models. Recent progress in continuous representation FWI (CR-FWI) demonstrates that representing parameter models with a coordinate-based neural network, such as implicit neural representation (INR), can mitigate the dependence on initial models. However, its underlying mechanism remains unclear, and INR-based FWI shows slower high-frequency convergence. In this work, we investigate the general CR-FWI framework and develop a unified theoretical understanding by extending the neural tangent kernel (NTK) for FWI to establish a wave-based NTK framework. Unlike standard NTK, our analysis reveals that wave-based NTK is not constant, both at initialization and during training, due to the inherent nonlinearity of FWI. We further show that the eigenvalue decay behavior of the wave-based NTK can explain why CR-FWI alleviates the dependency on initial models and shows slower high-frequency convergence. Building on these insights, we propose several CR-FWI methods with tailored eigenvalue decay properties for FWI, including a novel hybrid representation combining INR and multi-resolution grid (termed IG-FWI) that achieves a more balanced trade-off between robustness and high-frequency convergence rate. Applications in geophysical exploration on Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP model, and the more realistic 2014 Chevron models show the superior performance of our proposed methods compared to conventional FWI and existing INR-based FWI methods.
OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Kevin Valencia ⋅ Thilina Balasooriya ⋅ Xihaier Luo ⋅ Shinjae Yoo ⋅ David Park
Multimodal spatiotemporal learning on real-world experimental data is constrained by two challenges: within-modality measurements are sparse, irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of available modalities varies across space and time, shrinking the usable record unless models can adapt to arbitrary subsets at train and test time. We propose OmniField, a continuity-aware framework that learns a continuous neural field conditioned on available modalities and iteratively fuses cross-modal context. A multimodal crosstalk block architecture paired with iterative cross-modal refinement aligns signals prior to the decoder, enabling unified reconstruction, interpolation, forecasting, and cross-modal prediction without gridding or surrogate preprocessing. Extensive evaluations show that OmniField consistently outperforms eight strong multimodal spatiotemporal baselines. Under heavy simulated sensor noise, performance remains close to clean-input levels, highlighting robustness to corrupted measurements.
Test-Time Accuracy-Cost Control in Neural Simulators via Recurrent-Depth
Harris Abdul Majid ⋅ Pietro Sittoni ⋅ Francesco Tudisco
Accuracy-cost trade-offs are a fundamental aspect of scientific computing. Classical numerical methods inherently offer such a trade-off: increasing resolution, order, or precision typically yields more accurate solutions at higher computational cost. We introduce \textbf{Recurrent-Depth Simulator} (\textbf{RecurrSim}) an architecture-agnostic framework that enables explicit test-time control over accuracy-cost trade-offs in neural simulators without requiring retraining or architectural redesign. By setting the number of recurrent iterations $K$, users can generate fast, less-accurate simulations for exploratory runs or real-time control loops, or increase $K$ for more-accurate simulations in critical applications or offline studies. We demonstrate RecurrSim's effectiveness across fluid dynamics benchmarks (Burgers, Korteweg-De Vries, Kuramoto-Sivashinsky), achieving physically faithful simulations over long horizons even in low-compute settings. On high-dimensional 3D compressible Navier-Stokes simulations with 262k points, a 0.8B parameter RecurrFNO outperforms 1.6B parameter baselines while using 13.5\% less training memory. RecurrSim consistently delivers superior accuracy-cost trade-offs compared to alternative adaptive-compute models, including Deep Equilibrium and diffusion-based approaches. We further validate broad architectural compatibility: RecurrViT reduces error accumulation by 90\% compared to standard Vision Transformers on Active Matter, while RecurrUPT matches UPT performance on ShapeNet-Car using 44\% fewer parameters.
CFO: Learning Continuous-Time PDE Dynamics via Flow-Matched Neural Operators
Xianglong Hou ⋅ Xinquan Huang ⋅ Paris Perdikaris
Neural operator surrogates for time-dependent partial differential equations (PDEs) conventionally employ autoregressive (AR) prediction schemes, which accumulate error over long rollouts and require uniform temporal discretization. We introduce the Continuous Flow Operator (CFO), a framework that learns continuous-time PDE dynamics without the computational burden of standard continuous approaches, e.g., neural ODE. The key insight is repurposing flow matching to directly learn the right-hand side of PDEs without backpropagating through ODE solvers. CFO fits temporal splines to trajectory data, using finite-difference estimates of time derivatives at knots to construct probability paths whose velocities closely approximate the true PDE dynamics. A neural operator is then trained via flow matching to predict these analytic velocity fields. This approach is inherently time-resolution invariant: training accepts trajectories sampled on trajectory-specific, non-uniform time grids while inference queries solutions at any temporal resolution through ODE integration. Across four benchmarks (Lorenz, 1D Burgers, 2D diffusion-reaction, 2D shallow water), CFO demonstrates superior long-horizon stability and remarkable data efficiency. CFO trained on only 25% of irregularly subsampled time points outperforms autoregressive baselines trained on complete data, with relative error reductions up to 87%. Despite requiring numerical integration at inference, CFO achieves competitive efficiency, exceeding AR accuracy using only half the AR rollout step budget, while uniquely enabling reverse-time inference and arbitrary temporal querying. Code is available at https://github.com/shannon-hou/CFO_official
ComPhy: Composing Physical Models with end-to-end Alignment
Alessandro Trenta ⋅ Andrea Cossu ⋅ Davide Bacciu
Real-world phenomena typically involve multiple, interwoven dynamics that can be elegantly captured by systems of Partial Differential Equations (PDEs). However, accurately solving such systems remains a challenge. In this paper, we introduce ComPhy (CP), a novel modular framework designed to leverage the inherent physical structure of the problem to solve systems of PDEs. CP assigns each PDE to a dedicated learning module, each capable of incorporating state-of-the-art methodologies such as Physics-Informed Neural Networks or Neural Conservation Laws. Crucially, CP introduces an end-to-end alignment mechanism, explicitly designed around the physical interplay of shared variables, enabling knowledge transfer between modules, and promoting solutions that are the result of the collective effort of all modules. CP is the first approach specifically designed to tackle systems of PDEs, and our results show that it outperforms state-of-the-art approaches where a single model is trained on all PDEs at once.
Fast training of accurate physics-informed neural networks without gradient descent
Chinmay Datar ⋅ Taniya Kapoor ⋅ Abhishek Chandra ⋅ Qing Sun ⋅ Erik L. Bolager ⋅ Iryna Burak ⋅ Anna Veselovska ⋅ Massimo Fornasier ⋅ Felix Dietrich
Solving time-dependent Partial Differential Equations (PDEs) is one of the most critical problems in computational science. While Physics-Informed Neural Networks (PINNs) offer a promising framework for approximating PDE solutions, their accuracy and training speed are limited by two core barriers: gradient-descent-based iterative optimization over complex loss landscapes and non-causal treatment of time as an extra spatial dimension. We present Frozen-PINN, a novel PINN based on the principle of space-time separation that leverages random features instead of training with gradient descent, and incorporates temporal causality by construction. On nine PDE benchmarks, including challenges like extreme advection speeds, shocks, and high-dimensionality, Frozen-PINNs achieve superior training efficiency and accuracy over state-of-the-art PINNs, often by several orders of magnitude. Our work addresses longstanding training and accuracy bottlenecks of PINNs, delivering quickly trainable, highly accurate, and inherently causal PDE solvers, a combination that prior methods could not realize. Our approach challenges the reliance of PINNs on stochastic gradient-descent-based methods and specialized hardware, leading to a paradigm shift in PINN training and providing a challenging benchmark for the community.
KANO: Kolmogorov-Arnold Neural Operator
Jin Lee ⋅ Ziming Liu ⋅ Xinling Yu ⋅ Yixuan Wang ⋅ Haewon Jeong ⋅ Murphy Niu ⋅ Zheng Zhang
We introduce Kolmogorov–Arnold Neural Operator (KANO), a dual‑domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over a generic position-dependent dynamics for any physical input, whereas FNO stays practical only to spectrally sparse operators and strictly imposes fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground‑truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx6\times10^{-6}$ state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, $\approx1.5\times10^{-2}$, by orders of magnitude.
Buckingham $\pi$-Invariant Test‑Time Projection for Robust PDE Surrogate Modeling
Seokki Lee ⋅ Min-Chul Park ⋅ Giyong Hong ⋅ Changwook Jeong
PDE surrogate models such as FNO and PINN struggle to predict solutions across inputs with diverse physical units and scales, limiting their out-of-distribution (OOD) generalization. We propose a $\pi$-invariant test-time projection that aligns test inputs with the training distribution by solving a log-space least squares problem that preserves Buckingham $\pi$-invariants. For PDEs with multidimensional spatial fields, we use geometric representative $\pi$-values to compute distances and project inputs, overcoming degeneracy and singular points that limit prior $\pi$-methods. To accelerate projection, we cluster the training set into K clusters, reducing the complexity from O(MN) to O(KN) for the M training and N test samples. Across wide input scale ranges, tests on 2D thermal conduction and linear elasticity achieve an average MAE reduction up to $\approx 91\\%$ with minimal overhead. This training-free, model-agnostic method is expected to apply to more diverse PDE-based simulations.
Map as a Prompt: Learning Multi-Modal Spatial-Signal Foundation Models for Cross-scenario Wireless Localization
Yong Chu ⋅ Xun Zhou ⋅ Zenglin Xu ⋅ HUI WANG ⋅ Yue Yu
Accurate and robust wireless localization is a critical enabler for emerging 5G/6G applications, including autonomous driving, extended reality, and smart manufacturing. Despite its importance, achieving precise localization across diverse environments remains challenging due to the complex nature of wireless signals and their sensitivity to environmental changes. Existing data-driven approaches often suffer from limited generalization capability, requiring extensive labeled data and struggling to adapt to new scenarios. To address these limitations, we propose SigMap, a multimodal foundation model that introduces two key innovations: (1) A cycle-adaptive masking strategy that dynamically adjusts masking patterns based on channel periodicity characteristics to learn robust wireless representations; (2) A novel "map-as-prompt" framework that integrates 3D geographic information through lightweight soft prompts for effective cross-scenario adaptation. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple localization tasks while exhibiting strong zero-shot generalization in unseen environments, significantly outperforming both supervised and self-supervised baselines by considerable margins.
Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics
Hanming Yang · Antonio Khalil Moretti · Sebastian Macaluso · Philippe Chlenski · Christian A. Naesseth · Itsik Pe'er
Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses of collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.
Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
Kai Zeng ⋅ Zhanqian Wu ⋅ Kaixin Xiong ⋅ Xiaobao Wei ⋅ Xiangyu Guo ⋅ Zhenxin Zhu ⋅ Kalok Ho ⋅ Lijun Zhou ⋅ Bohan Zeng ⋅ Ming Lu ⋅ Haiyang Sun ⋅ Bing Wang ⋅ Guang Chen ⋅ Hangjun Ye ⋅ Wentao Zhang
Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are {\bf really crucial} for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project website: \url{https://wm-research.github.io/Dream4Drive/}.
Grounding Generative Planners in Verifiable Logic: A Hybrid Architecture for Trustworthy Embodied AI
Feiyu Wu ⋅ Xu Zheng ⋅ Yue Qu ⋅ Zhuocheng Wang ⋅ Zicheng Feng ⋅ HUI LI
While Large Language Models (LLMs) show immense promise as planners for embodied AI, their stochastic nature and lack of formal reasoning capabilities prevent the strict safety guarantees required for physical deployment. Current approaches fall short: they either rely on other unreliable LLMs for safety checks or simply reject unsafe plans without offering a path to success. This work bridges this critical gap by introducing the Verifiable Iterative Refinement Framework (VIRF), a neuro-symbolic architecture that shifts the paradigm from a passive safety gatekeeper to an active safety collaborator. Where prior verifiers simply reject failures, our framework provides causal, pedagogical feedback that teaches the LLM why its plan was unsafe, enabling intelligent repairs rather than mere avoidance.Our core contribution is a novel tutor-apprentice dialogue, where a deterministic Logic Tutor, grounded in a formal safety ontology, provides causal and explanatory feedback to an LLM Apprentice planner. This pedagogical interaction allows the apprentice to perform intelligent, creative plan repairs, resolving safety conflicts rather than merely avoiding them. To ground this dialogue in verifiable truth, we introduce a scalable knowledge acquisition pipeline that synthesizes a comprehensive safety knowledge base from real-world documents, a process that simultaneously reveals and corrects significant blind spots in existing benchmarks. On a new suite of challenging home safety tasks, VIRF achieves a perfect 0\% Hazardous Action Rate (HAR), completely eliminating unsafe actions while attaining a 77.3\% Goal-Condition Rate (GCR)—the highest among all baselines. It does so with remarkable efficiency, requiring only 1.1 correction iterations on average. By acting as a verifiable safety scaffold, VIRF demonstrates a principled and robust pathway toward building embodied agents that are not just capable, but fundamentally trustworthy.
Robust Fine-tuning of Vision-Language-Action Robot Policies via Parameter Merging
Yajat Yadav ⋅ Zhiyuan Zhou ⋅ Andrew Wagenmaker ⋅ Karl Pertsch ⋅ Sergey Levine
Generalist robot policies, trained on large and diverse datasets, have demonstrated the ability to generalize across a wide spectrum of behaviors, enabling a single policy to act in varied real-world environments. However, they still fall short on new tasks not covered in the training data. When finetuned on limited demonstrations of a new task, these policies often overfit to the specific demonstrations---not only losing their prior abilities to solve a wide variety of generalist tasks but also failing to generalize within the new task itself. In this work, we aim to develop a method that preserves the generalization capabilities of the generalist policy during finetuning, allowing a single policy to robustly incorporate a new skill into its repertoire. Our goal is a single policy that both learns to generalize to variations of the new task and retains the broad competencies gained from pretraining. We show that this can be achieved through a simple yet effective strategy: interpolating the weights of a finetuned model with that of the pretrained model. We show, across extensive simulated and real-world experiments, that such model merging produces a single model that inherits the generalist abilities of the base model and learns to solve the new task robustly, outperforming both the pretrained and finetuned model on out-of-distribution variations of the new task. Moreover, we show that model merging enables continual acquisition of new skills in a lifelong learning setting, without sacrificing previously learned generalist abilities.
Real-Time Robot Execution with Masked Action Chunking
Haoxuan Wang ⋅ Gengyu Zhang ⋅ Yan Yan ⋅ Yuzhang Shang ⋅ Ramana Kompella ⋅ Gaowen Liu
Real-time execution is essential for cyber-physical systems such as robots. These systems operate in dynamic real-world environments where even small delays can undermine responsiveness and compromise performance. Asynchronous inference has recently emerged as a system-level paradigm for real-time robot manipulation, enabling the next action chunk to be predicted while the current one is being executed. While this approach achieves real-time responsiveness, naive integration often results in execution failure. Previous methods attributed this failure to inter-chunk discontinuity and developed test-time algorithms to smooth chunk boundaries. In contrast, we identify another critical yet overlooked factor: intra-chunk inconsistency, where the robot’s executed action chunk partially misaligns with its current perception. To address this, we propose REMAC, which learns corrective adjustments on the pretrained policy through masked action chunking, enabling the policy to remain resilient under mismatches between intended actions and actual execution during asynchronous inference. In addition, we introduce a prefix-preserved sampling procedure to reinforce inter-chunk continuity. Overall, our method delivers more reliable policies without incurring additional latency. Extensive experiments in both simulation and real-world settings demonstrate that our method enables faster task execution, maintains robustness across varying delays, and consistently achieves higher completion rates.
RF-MatID: Dataset and Benchmark for Radio Frequency Material Identification
Xinyan Chen ⋅ Qinchun Li ⋅ Ruiqin Ma ⋅ Jiaqi Bai ⋅ Li Yi ⋅ Jianfei Yang
Accurate material identification plays a crucial role in embodied AI systems, enabling a wide range of applications. However, current vision-based solutions are limited by the inherent constraints of optical sensors, while radio-frequency (RF) approaches, which can reveal intrinsic material properties, have received growing attention. Despite this progress, RF-based material identification remains hindered by the lack of large-scale public datasets and the limited benchmarking of learning-based approaches. In this work, we present RF-MatID, the first open-source, large-scale, wide-band, and geometry-diverse RF dataset for fine-grained material identification. RF-MatID includes 16 fine-grained categories grouped into 5 superclasses, spanning a broad frequency range from 4 to 43.5 GHz, and comprises 142k samples in both frequency- and time-domain representations. The dataset systematically incorporates controlled geometry perturbations, including variations in incidence angle and stand-off distance. We further establish a multi-setting, multi-protocol benchmark by evaluating state-of-the-art deep learning models, assessing both in-distribution performance and out-of-distribution robustness under cross-angle and cross-distance shifts. The 5 frequency-allocation protocols enable systematic frequency- and region-level analysis, thereby facilitating real-world deployment. RF-MatID aims to enable reproducible research, accelerate algorithmic advancement, foster cross-domain robustness, and support the development of real-world application in RF-based material identification.
All-day Multi-scenes Lifelong Vision-and-Language Navigation with Tucker Adaptation
XUDONG WANG ⋅ Gan Li ⋅ Zhiyu Liu ⋅ Yao Wang ⋅ Lianqing Liu ⋅ Zhi Han
Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.
DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving
Yingyan Li ⋅ Shuyao Shang ⋅ Weisong Liu ⋅ Bing Zhan ⋅ Haochen Wang ⋅ Yuqi Wang ⋅ Yuntao Chen ⋅ Xiaoman Wang ⋅ Yasong An ⋅ Chufeng Tang ⋅ LU HOU ⋅ Lue Fan ⋅ Zhaoxiang Zhang
Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit'': the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose DriveVLA-W0, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm's versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.
M$^3$E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts
Yongliang Jiang ⋅ Huaidong Zhang ⋅ Xuandi Luo ⋅ Shengfeng He
Vision-and-Language Navigation (VLN) agents have shown strong capabilities in following natural language instructions. However, they often struggle to generalize across environments due to catastrophic forgetting, which limits their practical use in real-world settings where agents must continually adapt to new domains. We argue that overcoming forgetting across environments hinges on decoupling global scene reasoning from local perceptual alignment, allowing the agent to adapt to new domains while preserving specialized capabilities. To this end, we propose M$^3$E, the Mixture of Macro and Micro Experts, an environment-aware hierarchical MoE framework for continual VLN. Our method introduces a dual-router architecture that separates navigation into two levels of reasoning. A macro-level, scene-aware router selects strategy experts based on global environmental features (e.g., office vs. residential), while a micro-level, instance-aware router activates perception experts based on local instruction-vision alignment for step-wise decision making. To preserve knowledge across domains, we adopt a dynamic momentum update strategy that identifies expert utility in new environments and selectively updates or freezes their parameters. We evaluate M$^3$E in a domain-incremental setting on the R2R and REVERIE datasets, where agents learn across unseen scenes without revisiting prior data. Results show that our method consistently outperforms standard fine-tuning and existing continual learning baselines in both adaptability and knowledge retention, offering a parameter-efficient solution for building generalizable embodied agents.
VLMgineer: Vision-Language Models as Robotic Toolsmiths
George Gao ⋅ Tianyu Li ⋅ Junyao Shi ⋅ Yihan Li ⋅ Zizhe Zhang ⋅ Nadia Figueroa ⋅ Dinesh Jayaraman
Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, it is often regarded as a measurable indicator of cognitive intelligence across biological species. While much of today’s research on robotics intelligence focuses on generating better control strategies, inventing smarter tools offers a complementary form of physical intelligence: moving the problem-solving onus into the tool’s geometry so that control becomes simpler. This motivates us to ask: can today’s foundation models offer useful priors to automatically invent—and effectively wield—such tools? We present VLMgineer, the first fully automatic framework designs tools and actions from scratch by harnessing the creativity of Vision–Language Models (VLMs) together with evolutionary search. We evaluate VLMgineer on a diverse benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also consistently outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. We further demonstrate that VLMgineer’s automatically designed tools and action policies transfer seamlessly to real-world task execution on a physical robot. To facilitate future research on automated tool invention, we will release our benchmark and code. Project Website: https://vlmgineer.github.io/.
Self-Improving Vision-Language-Action Models with Data Generation via Residual RL
Wenli Xiao ⋅ Haotian Lin ⋅ Andy Peng ⋅ Haoru Xue ⋅ Tairan He ⋅ Zhengyi Luo ⋅ Yuqi Xie ⋅ Fengyuan Hu ⋅ Jim Fan ⋅ Guanya Shi ⋅ Yuke Zhu
Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a plug-and-play framework that improves VLAs through residual reinforcement learning and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates practicality on real-world Franka arm manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.
Differentiable Model Predictive Control on the GPU
Emre Adabag ⋅ Marcus Greiff ⋅ John Subosits ⋅ Thomas Lew
Differentiable model predictive control (MPC) offers a powerful framework for combining learning and control. However, its adoption has been limited by the inherently sequential nature of traditional optimization algorithms, which are challenging to parallelize on modern computing hardware like GPUs. In this work, we tackle this bottleneck by introducing a GPU-accelerated differentiable optimization tool for MPC. This solver leverages sequential quadratic programming and a custom preconditioned conjugate gradient (PCG) routine with tridiagonal preconditioning to exploit the problem's structure and enable efficient parallelization. We demonstrate substantial speedups over CPU- and GPU-based baselines, significantly improving upon state-of-the-art training times on benchmark reinforcement learning and imitation learning tasks. Finally, we showcase the method on the challenging task of reinforcement learning for driving at the limits of handling, where it enables robust drifting of a Toyota Supra through water puddles.
ManipEvalAgent: Promptable and Efficient Evaluation Framework for Robotic Manipulation Policies
Yiteng Chen ⋅ Huiping Zhuang ⋅ Wenbo Li ⋅ Shiyi Wang ⋅ Xiangyu Zhao ⋅ Qingyao Wu
In recent years, robotic manipulation policies have made substantial progress. However, evaluating these policies typically requires large-scale sampling in simulation benchmarks, leading to high time costs. Moreover, existing evaluation pipelines are usually fixed, do not account for user needs, and report only a single scalar score, lacking interpretability. In contrast, human experts can quickly form an intuitive impression of a policy’s capabilities from just a handful of executions. We therefore propose ManipEvalAgent, an efficient, promptable, and dynamically multi-round evaluation framework for robotic manipulation policies. The framework conducts small-batch, multi-round evaluations and adaptively plans subsequent evaluation steps based on intermediate observations from each round. Via code generation, it constructs tasks and evaluation functions within simulator. By generating evaluation functions and leveraging vision–language models (VLMs) for video understanding, ManipEvalAgent provides user-instruction-centric, fine-grained analysis. Our approach offers three key advantages: (1) efficiency, no need for massive sampling; (2) promptable, planning the evaluation process according to user queries; and (3) interpretability, providing diagnostic text that goes beyond a single score. Across multiple settings, our evaluation method significantly shortens the overall time compared with traditional simulation benchmarks, while reaching conclusions comparable to those from large-scale simulation benchmarks.
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Yixiao Wang ⋅ Mingxiao Huo ⋅ Zhixuan Liang ⋅ Yushi Du ⋅ Lingfeng Sun ⋅ Haotian Lin ⋅ Jinghuan Shang ⋅ Chensheng Peng ⋅ Mohit Bansal ⋅ Mingyu Ding ⋅ Masayoshi Tomizuka
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full retraining to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. We then fine-tune only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. More visualizations and codes are available in https://yixiaowang7.github.io/ver_page/.
Abstracting Robot Manipulation Skills via Mixture-of-Experts Diffusion Policies
Ce Hao ⋅ Xuanran Zhai ⋅ Yaohua Liu ⋅ Harold Soh
Diffusion-based policies have recently shown strong results in robot manipulation, but their extension to multi-task scenarios is hindered by the high cost of scaling model size and demonstrations. We introduce Skill Mixture-of-Experts Policy (SMP), a diffusion-based mixture-of-experts policy that learns a compact orthogonal skill basis and uses sticky routing to compose actions from a small, task-relevant subset of experts at each step. A variational training objective supports this design, and adaptive expert activation at inference yields fast sampling without oversized backbones. We validate SMP in simulation and on a real dual-arm platform with multi-task learning and transfer learning tasks, where SMP achieves higher success rates and markedly lower inference cost than large diffusion baselines. These results indicate a practical path toward scalable, transferable multi-task manipulation: learn reusable skills once, activate only what is needed, and adapt quickly when tasks change.
Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance
Yang Zhang ⋅ Chenwei Wang ⋅ ouyang lu ⋅ Yuan Zhao ⋅ Yunfei Ge ⋅ Zhenglong Sun ⋅ Xiu Li ⋅ Chi Zhang ⋅ Chenjia Bai ⋅ Xuelong Li
Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks. Our code is released at \url{https://github.com/TeleHuman/Align-Then-Steer}.
RoboOmni: Proactive Robot Manipulation in Omni-modal Context
Siyin Wang ⋅ Jinlan Fu ⋅ Feihong Liu ⋅ Xinzhe He ⋅ Huangxuan Wu ⋅ Junhao Shi ⋅ Kexin Huang ⋅ Zhaoye Fei ⋅ Jingjing Gong ⋅ Zuxuan Wu ⋅ Yu-Gang Jiang ⋅ See-Kiong Ng ⋅ Tat-Seng Chua ⋅ Xipeng Qiu
Recent advances in Multimodal Large Language Models (MLLMs) have driven rapid progress in Vision–Language–Action (VLA) models for robotic manipulation. Although effective in many scenarios, current approaches largely rely on explicit instructions, whereas in real-world interactions, humans rarely issue instructions directly. Effective collaboration requires robots to infer user intentions proactively. In this work, we introduce cross-modal contextual instructions, a new setting where intent is derived from spoken dialogue, environmental sounds, and visual cues rather than explicit commands. To address this new setting, we present RoboOmni, a Perceiver-Thinker-Talker-Executor framework based on end-to-end omni-modal LLMs that unifies intention recognition, interaction confirmation, and action execution. RoboOmni fuses auditory and visual signals spatiotemporally for robust intention recognition, while supporting direct speech interaction. To address the absence of training data for proactive intention recognition in robotic manipulation, we build OmniAction, comprising 140k episodes, 5k+ speakers, 2.4k event sounds, 640 backgrounds, and six contextual instruction types. Experiments in simulation and real-world settings show that RoboOmni surpasses text- and ASR-based baselines in success rate, inference speed, intention recognition, and proactive assistance. All datasets, code, and real-world demonstration videos will be released publicly.
Adapt Data to Model: Adaptive Transformation Optimization for Domain-shared Time Series Foundation Models
Yunzhong Qiu ⋅ Zhiyao Cen ⋅ Zhongyi Pei ⋅ Chen Wang ⋅ Jianmin Wang
Large time series models (LTMs) have emerged as powerful tools for universal forecasting, yet they often struggle with the inherent diversity and nonstationarity of real-world time series data, leading to an unsatisfactory trade-off between forecasting accuracy and generalization. Rather than continually finetuning new LTM instances for each domain, we propose a data-centric framework, time-series adaptive transformation optimization (TATO), that enables a single frozen pre-trained LTM to adapt to diverse downstream domains through an optimally configured transformation pipeline. Specifically, TATO constructs three representative types of transformations, including context slicing, scale normalization, and outlier correction, to help LTMs better align with target domain characteristics. To ensure robustness, we incorporate carefully selected time series augmentations and a two-stage ranking mechanism that filters out pipelines underperforming on specific metrics. Extensive experiments on state-of-the-art LTMs and widely used datasets demonstrate that TATO consistently and significantly improves domain-adaptive forecasting performance, achieving a maximum reduction in MSE of 65.4\% and an average reduction of 13.6\%. Moreover, TATO is highly efficient, typically completing optimization in under 2 minutes, making it practical for real-world deployment. The source code is available at https://github.com/thulab/TATO.
Uncertainty-Aware Gaussian Map for Vision-Language Navigation
Jianzhe Gao ⋅ Rui Liu ⋅ Yuxuan Xu ⋅ Tongtong Cao ⋅ Yingxue Zhang ⋅ Zhanguang Zhang ⋅ Sida Peng ⋅ Yi Yang ⋅ Wenguan Wang
Vision-Language Navigation (VLN) requires an agent to navigate 3D environments following natural language instructions. During navigation, existing agents commonly encounter perceptual uncertainty, such as insufficient evidence for reliable grounding or ambiguity in interpreting spatial cues, yet they typically ignore such information when predicting actions. In this work, we explicitly model three forms of perceptual uncertainty (i.e., geometric, semantic, and appearance uncertainty) and integrate them into the agent’s observation space to enable informed decision-making. Concretely, our agent first constructs a Semantic Gaussian Map (SGM), composed of differentiable 3D Gaussian primitives initialized from panoramic observations, that encodes both the geometric structure and semantic content of the environment. On top of SGM, geometric uncertainty is estimated through variational perturbations of Gaussian position and scale to assess structural reliability; semantic uncertainty is captured by perturbing Gaussian semantic attributes to reveal ambiguous interpretations; and appearance uncertainty is characterized by Fisher Information, which measures the sensitivity of rendered observations to Gaussian-level variations. These uncertainties are incorporated into SGM, extending it into a unified 3D Value Map, which grounds them as affordances and constraints that support reliable navigation. Comprehensive evaluations across multiple VLN benchmarks show the effectiveness of our agent.
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation
Meng Wei ⋅ Chenyang Wan ⋅ Peng ⋅ Xiqian Yu ⋅ Yuqiang Yang ⋅ Delin Feng ⋅ Wenzhe Cai ⋅ Chenming Zhu ⋅ Tai Wang ⋅ Jiangmiao Pang ⋅ Xihui Liu
While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
Autonomous Functional Play with Correspondence-Driven Trajectory Warping
William Liang ⋅ Sam Wang ⋅ Hung-Ju Wang ⋅ Osbert Bastani ⋅ Yecheng Jason Ma ⋅ Dinesh Jayaraman
The ability to conduct and learn from interaction and experience is a central challenge in robotics, offering a scalable alternative to labor-intensive human demonstrations. However, realizing such "play" requires (1) a policy robust to diverse, potentially out-of-distribution environment states, and (2) a procedure that continuously produces useful robot experience. To address these challenges, we introduce Tether, a method for autonomous functional play involving structured, task-directed interactions. First, we design a novel open-loop policy that warps actions from a small set of source demonstrations (≤10) by anchoring them to semantic keypoint correspondences in the target scene. We show that this design is extremely data-efficient and robust even under significant spatial and semantic variations. Second, we deploy this policy for autonomous functional play in the real world via a continuous cycle of task selection, execution, evaluation, and improvement, guided by the visual understanding capabilities of vision-language models. This procedure generates diverse, high-quality datasets with minimal human intervention. In a household-like multi-object setup, our method is the first to perform many hours of autonomous multi-task play in the real world starting from only a handful of demonstrations. This produces a stream of data that consistently improves the performance of closed-loop imitation policies over time, ultimately yielding over 1000 expert-level trajectories and training policies competitive with those learned from human-collected demonstrations.
$AutoDrive\text{-}P^3$: Unified Chain of Perception–Prediction–Planning Thought via Reinforcement Fine-Tuning
Yuqi Ye ⋅ Zijian Zhang ⋅ Junhong Lin ⋅ Shangkun Sun ⋅ Changhao Peng ⋅ Wei Gao
Vision-language models (VLMs) are increasingly being adopted for end-to-end autonomous driving systems due to their exceptional performance in handling long-tail scenarios. However, current VLM-based approaches suffer from two major limitations: 1) Some VLMs directly output planning results without chain-of-thought (CoT) reasoning, bypassing crucial perception and prediction stages which creates a significant domain gap and compromises decision-making capability; 2) Other VLMs can generate outputs for perception, prediction, and planning tasks but employ a fragmented decision-making approach where these modules operate seperately, leading to a significant lack of synergy that undermines true planning performance. To address these limitations, we propose ${AutoDrive\text{-}P^3}$, a novel framework that seamlessly integrates $\underline{\textbf{P}}$erception, $\underline{\textbf{P}}$rediction, and $\underline{\textbf{P}}$lanning through structured reasoning. We introduce the ${P^3\text{-}CoT}$ dataset to facilitate coherent reasoning and propose ${P^3\text{-}GRPO}$, a hierarchical reinforcement learning algorithm that provides progressive supervision across all three tasks. Specifically, ${AutoDrive\text{-}P^3}$ progressively generates CoT reasoning and answers for perception, prediction, and planning, where perception provides essential information for subsequent prediction and planning, while both perception and prediction collectively contribute to the final planning decisions, enabling safer and more interpretable autonomous driving. Additionally, to balance inference efficiency with performance, we introduce dual thinking modes: detailed thinking and fast thinking. Extensive experiments on both open-loop (nuScenes) and closed-loop (NAVSIMv1/v2) benchmarks demonstrate that our approach achieves state-of-the-art performance in planning tasks. Code is available at https://github.com/haha-yuki-haha/AutoDrive-P3, https://openi.pcl.ac.cn/OpenAIDriving/AutoDrive-P3.
Capturing Visual Environment Structure Correlates with Control Performance
Jiahua Dong ⋅ Yunze Man ⋅ Pavel Tokmakov ⋅ Yu-Xiong Wang
The choice of visual representation is key to scaling generalist robot policies. However, direct evaluation via policy rollouts is expensive, even in simulation. Existing proxy metrics focus on the representation's capacity to capture narrow aspects of the visual world, like object shape, limiting generalization across environments. In this paper, we take an analytical perspective: we probe pretrained visual encoders by measuring how well they support decoding of environment state—including geometry, object structure, and physical attributes—from images. Leveraging simulation environments with access to ground-truth state, we show that this probing accuracy strongly correlates with downstream policy performance across diverse environments and learning settings, significantly outperforming prior metrics. Our study provides insight into the representational properties that support generalizable manipulation, suggesting that learning to encode full environment state is a promising objective for visual representations for control.
COOPERTRIM: Adaptive Data Selection for Uncertainty-Aware Cooperative Perception
Shilpa Mukhopadhyay ⋅ Amit Roy-Chowdhury ⋅ Hang Qiu
Cooperative perception enables autonomous agents to share encoded representations over wireless communication to enhance each other’s live situational awareness. However, the tension between the limited communication bandwidth and the rich sensor information hinders its practical deployment. Recent studies have explored selection strategies that share only a subset of features per frame while striving to keep the performance on par. Nevertheless, the bandwidth requirement still stresses current wireless technologies. To fundamentally ease the tension, we take a proactive approach, exploiting the temporal continuity to identify features that capture environment dynamics, while avoiding repetitive and redundant transmission of static information. By incorporating temporal awareness, agents are empowered to dynamically adapt the sharing quantity according to environment complexity. We instantiate this intuition into an adaptive selection framework, COOPERTRIM, which introduces a novel conformal temporal uncertainty metric to gauge feature relevance, and a data-driven mechanism to dynamically determine the sharing quantity. To evaluate COOPERTRIM, we take semantic segmentation and 3D detection as example tasks. Across multiple open-source cooperative segmentation and detection models, COOPERTRIM achieves up to 80.28% and 72.52% bandwidth reduction respectively while maintaining a comparable accuracy. Relative to other selection strategies, COOPERTRIM also improves IoU by as much as 45.54% with up to 72% less bandwidth. Combined with compression strategies, COOPERTRIM can further reduce bandwidth usage to as low as 1.46% without compromising IoU performance. Qualitative results show COOPERTRIM gracefully adapts to environmental dynamics, localization error, and communication latency, demonstrating flexibility and paving the way for real-world deployment.
Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
Muleilan Pei ⋅ Shaoshuai Shi ⋅ Shaojie Shen
Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative "SFT-RFT-SFT" training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
Repurposing Foundation Model for Generalizable Medical Time Series Classification
Nan Huang ⋅ Haishuai Wang ⋅ Zihuai He ⋅ Marinka Zitnik ⋅ Xiang Zhang
Medical time series (MedTS) classification suffers from poor generalizability in real-world deployment due to inter- and intra-dataset heterogeneity, such as varying numbers of channels, signal lengths, task definitions, and patient characteristics. % implicit patient characteristics, variable channel configurations, time series lengths, and diagnostic tasks. To address this, we propose FORMED, a novel framework for repurposing a backbone foundation model, pre-trained on generic time series, to enable highly generalizable MedTS classification on unseen datasets. FORMED combines the backbone with a novel classifier comprising two components: (1) task-specific channel embeddings and label queries, dynamically sized to match any number of channels and target classes, and (2) a shared decoding attention layer, jointly trained across datasets to capture medical domain knowledge through task-agnostic feature-query interactions. After repurposing, FORMED achieves seamless adaptation to unseen MedTS datasets through lightweight label query training (0.1\% of parameters), eliminating the need for full fine-tuning or architectural redesign. We evaluate FORMED on 5 diverse MedTS datasets, benchmarking against 11 Task-Specific Models (TSM) and 4 Task-Specific Adaptation (TSA) methods. Our results demonstrate FORMED's dominant performance, achieving up to 35\% absolute improvement in F1-score (on ADFTD dataset) over specialized baselines. By decoupling domain-invariant representation learning from task-specific adaptation, FORMED establishes a scalable and resource-efficient paradigm for foundation model repurposing in healthcare. This approach prioritizes clinical adaptability over rigid task-centric design, offering a practical pathway for real-world implementation.
Complexity- and Statistics-Guided Anomaly Detection in Time Series Foundation Models
Jongwon Kim ⋅ Samuel Yoon ⋅ Young Ko ⋅ Yerin Kim ⋅ Sung Kim ⋅ JAEUNG TAE
This paper introduces a methodology for anomaly detection in time series using Time Series Foundation Models (TFMs). While TFMs have achieved strong success in forecasting, their role in anomaly detection remains underexplored. We identify two key challenges when applying TFMs to reconstruction-based anomaly detection and propose solutions. The first challenge is overgeneralization, where TFMs reconstruct both normal and abnormal data with similar accuracy, masking true anomalies. We find that this effect often occurs in data with strong low-frequency components. To address it, we propose a complexity metric, $\alpha$, that reflects how difficult the data is for TFMs and design a Complexity-Aware Ensemble (CAE) that adaptively balances TFMs with a statistical model. The second challenge is overstationarization, caused by instance normalization layers that improve forecasting accuracy but remove essential statistical features such as mean and variance, which are critical for anomaly detection. We resolve this by reintroducing these features into the reconstruction process without retraining the TFMs. Experiments on 23 univariate and 17 multivariate benchmark datasets demonstrate that our method significantly outperforms both deep learning and statistical baselines. Furthermore, we show that our complexity-based metric, $\alpha$, provides a theoretical foundation for improved anomaly detection, and we briefly explore prediction-based anomaly detection using TFMs.
Tackling Time-Series Forecasting Generalization via Mitigating Concept Drift
Zhiyuan Zhao ⋅ Haoxin Liu ⋅ B. Aditya Prakash
Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is important for time-series forecasting models to handle potential distribution shifts over time. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series forecasting, designing proper concept drift methods for time series forecasting has received comparatively less attention. Motivated by the need to address potential concept drift, while conventional concept drift methods via invariant learning face certain challenges in time-series forecasting, we propose a soft attention mechanism that finds invariant patterns from both lookback and horizon time series. Additionally, we emphasize the critical importance of mitigating temporal shifts as a preliminary to addressing concept drift. In this context, we introduce ShifTS, a method-agnostic framework designed to tackle temporal shift first and then concept drift within a unified approach. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and outperforming existing concept drift, temporal shift, and combined baselines.
SwiftTS: A Swift Selection Framework for Time Series Pre-trained Models via Multi-task Meta-Learning
Tengxue Zhang ⋅ Biao Ouyang ⋅ Yang Shu ⋅ Xinyang Chen ⋅ Guo ⋅ Bin Yang
Pre-trained models exhibit strong generalization to various downstream tasks. However, given the numerous models available in the model hub, identifying the most suitable one by individually fine-tuning is time-consuming. In this paper, we propose \textbf{SwiftTS}, a swift selection framework for time series pre-trained models. To avoid expensive forward propagation through all candidates, SwiftTS adopts a learning-guided approach that leverages historical dataset-model performance pairs across diverse horizons to predict model performance on unseen datasets. It employs a lightweight dual-encoder architecture that embeds time series and candidate models with rich characteristics, computing patchwise compatibility scores between data and model embeddings for efficient selection. To further enhance the generalization across datasets and horizons, we introduce a horizon-adaptive expert composition module that dynamically adjusts expert weights, and the transferable cross-task learning with cross-dataset and cross-horizon task sampling to enhance out-of-distribution (OOD) robustness. Extensive experiments on 14 downstream datasets and 8 pre-trained models demonstrate that SwiftTS achieves state-of-the-art performance in time series pre-trained model selection. The code and datasets are available at \href{}{https://github.com/decisionintelligence/SwiftTS}.
ARROW: An Adaptive Rollout and Routing Method for Global Weather Forecasting
Jindong Tian ⋅ Yifei Ding ⋅ Ronghui Xu ⋅ Hao Miao ⋅ Guo ⋅ Bin Yang
Weather forecasting is a fundamental task in spatiotemporal data analysis, with broad applications across a wide range of domains. Existing data-driven forecasting methods typically model atmospheric dynamics over a fixed short time interval, e.g., 6 hours, and rely on naive autoregression-based rollout for long-term forecasting, e.g., 5 days. However, this paradigm suffers from two key limitations: (1) it often inadequately models the spatial and multi-scale temporal dependencies inherent in global weather systems, and (2) the rollout strategy struggles to balance error accumulation with the capture of fine-grained atmospheric variations. In this study, we propose ARROW, an Adaptive-Rollout Multi-scale temporal Routing method for Global Weather Forecasting. To contend with the first limitation, we construct a multi-interval forecasting model that forecasts weather across different time intervals. Within the model, the Shared-Private Mixture-of-Experts captures both shared patterns and specific characteristics of atmospheric dynamics across different time scales, while Ring Positional Encoding accurately encodes the circular latitude structure of the Earth when representing spatial information. For the second limitation, we develop an adaptive rollout scheduler based on reinforcement learning, which selects the most suitable time interval to forecast according to the current weather state. Experimental results demonstrate that ARROW achieves state-of-the-art performance in global weather forecasting, establishing a promising paradigm in this field.
GTM: A General Time-series Model for Enhanced Representation Learning of Time-Series data
Cheng HE ⋅ Xu Huang ⋅ Gangwei Jiang ⋅ Zhaoyi Li ⋅ Defu Lian ⋅ Hong Xie ⋅ Enhong Chen ⋅ xijie liang ⋅ Zhengzengrong ⋅ Patrick P. C. Lee
Despite recent progress in time-series foundation models, challenges persist in improving representation learning and adapting to diverse downstream tasks. We introduce a General Time-series Model (GTM), which advances representation learning via a novel frequency-domain attention mechanism that captures time-granularity-aware features, an aspect underexplored in prior research. We further propose a novel pre-training strategy that unifies reconstruction and autoregressive objectives through a hybrid masking mechanism. Our pre-training strategy, combined with 2D positional encoding and span shuffling, enhances the robustness and generalization of representations. GTM is established as the first generative-task-agnostic model for time-series analysis, enabling seamless adaptation to various generative tasks without any task-specific modifications. Extensive experiments demonstrate that GTM consistently outperforms SOTA models on various generative tasks and achieves strong classification results with minimal adaptation. Furthermore, GTM exhibits clear scaling behavior, with accuracy improving as model size and pre-training data increase.
Local Geometry Attention for Time Series Forecasting under Realistic Corruptions
Dongbin Kim ⋅ Youngjoo Park ⋅ Woojin Jeong ⋅ Jaewook Lee
Transformers have demonstrated strong performance in time series forecasting, yet they often fail to capture the intrinsic structure of temporal data, making them susceptible to real-world noise and anomalies. Unlike in vision or language, the local geometry of temporal patterns is a critical feature in time series forecasting, but it is frequently disrupted by corruptions. In this work, we address this gap with two key contributions. First, we propose Local Geometry Attention (LGA), a novel attention mechanism theoretically grounded in local Gaussian process theory. LGA adapts to the intrinsic data geometry by learning query-specific distance metrics, enabling it to model complex temporal dependencies and enhance resilience to noise. Second, we introduce TSRBench, the first comprehensive benchmark for evaluating forecasting robustness under realistic, statistically-grounded corruptions. Experiments on TSRBench show that LGA significantly reduces performance degradation, consistently outperforming both Transformer and linear model. These results establish a foundation for developing robust time series models that can be deployed in real-world applications where data quality is not guaranteed. Our code is available at: https://github.com/dongbeank/LGA.
Causal Score Conditioning for Multi-Resolution Latent Systems
Xuechun Li ⋅ Shan Gao ⋅ Susu Xu
Complex causal systems with interdependent variables require inference from heterogeneous observations that vary in spatial resolution, temporal frequency, and noise characteristics due to data acquisition constraints. Existing multi-modal fusion approaches assume uniform data quality or complete observability -- assumptions often violated in real-world applications. Current methods face three limitations: they treat causally-related variables independently, failing to exploit causal relationships; they cannot integrate multi-resolution observations effectively; and they lack theoretical frameworks for cascaded approximation errors. We introduce the Score-based Variational Graphical Diffusion Model (SVGDM), which integrates score-based diffusion within causal graphical structures for inference under heterogeneous incomplete observations. SVGDM introduces causal score decomposition enabling information propagation across causally-connected variables while preserving original observation characteristics. Diffusion provides a natural way to model scale-dependent sensing noise, which is common in remote-sensing, climate, and physical measurement systems, while the causal graph encodes well-established mechanistic dependencies between latent processes. We provide theoretical analysis and demonstrate superior performance on both synthetic and real-world datasets compared to relevant baselines.
Unknown unknowns (U2s) are deployment-time scenarios absent from development/testing. Unlike conventional anomalies, U2s are not out-of-distribution (OOD); they stem from changes in underlying system dynamics without a distribution shift from normal data. Thus, existing multi-variate time series anomaly detection (MTAD) methods—which rely on distribution-shift cues—are ill-suited for U2 detection. Specifically: (i) we show most anomaly datasets exhibit distribution shift between normal and anomalous data and therefore are not representative of U2s; (ii) we introduce eight U2 benchmarks where training data contain OOD anomalies but no U2s, while test sets contain both OOD anomalies and U2s; (iii) we demonstrate that state-of-the-art (SOTA) MTAD results often depend on impractical enhancements: point adjustment (PA) (uses ground truth to flip false negatives to true positives, inflating precision) and threshold learning with data leakage (TL) (tuning thresholds on test data and labels); (iv) with PA+TL, even untrained deterministic methods can match or surpass MTAD baselines; (v) without PA/TL, existing MTAD methods degrade sharply on U2 benchmarks. Finally, we present sparse model identification–enhanced anomaly detection (SPIE-AD), a model-recovery-and-conformance, zero-shot MTAD approach that outperforms baselines on all eight U2 benchmarks and on six additional real-world MTAD datasets—without PA or TL. Code and data available in https://github.com/ImpactLabASU/U2Recognition.
When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency
Ren Fujiwara ⋅ Yasuko Matsubara ⋅ Yasushi Sakurai
Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER —a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $\theta$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.
OWL : Geometry-Aware Spatial Reasoning for Audio Large Language Models
Subrata Biswas ⋅ Mohammad Khan ⋅ Bashima Islam
Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single- step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o’clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$% over BAT. Our dataset and code are available at: https://github.com/BASHLab/OWL
Death of the Novel(ty): Beyond N-Gram Novelty as a Metric for Textual Creativity
Arkadiy Saakyan ⋅ Najoung Kim ⋅ Smaranda Muresan ⋅ Tuhin Chakrabarty
$N$-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity's dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and $n$-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via \emph{close reading} of human- and AI-generated text. We find that while $n$-gram novelty is positively associated with expert writer-judged creativity, approximately 91% of top-quartile $n$-gram novel expressions are not judged as creative, cautioning against relying on $n$-gram novelty alone. Furthermore, unlike in human-written text, higher $n$-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier closed-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify expressions perceived as novel by experts (a positive aspect of writing) or non-pragmatic (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty ratings align with expert writer preferences in an out-of-distribution dataset, more so than an n-gram based metric.
Flipping the Dialogue: Training and Evaluating User Language Models
Tarek Naous ⋅ Philippe Laban ⋅ Wei Xu ⋅ Jennifer Neville
Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often by prompting an LM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao ⋅ Jaylen Jones ⋅ Linxi Jiang ⋅ Yuting Ning ⋅ Eric Fosler-Lussier ⋅ Yu Su ⋅ ZHIQIANG LIN ⋅ Huan Sun
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection, where attackers embed malicious content into the environment to hijack agent behavior. Current evaluations of this threat either lack support for adversarial testing in realistic but controlled environments or ignore hybrid web-OS attack scenarios involving both interfaces. To address this, we propose RedTeamCUA, an adversarial testing framework featuring a novel hybrid sandbox that integrates a VM-based OS environment with Docker-based web platforms. Our sandbox supports key features tailored for red teaming, such as flexible adversarial scenario configuration, and a setting that decouples adversarial evaluation from navigational limitations of CUAs by initializing tests directly at the point of an adversarial injection. Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities. Benchmarking current frontier CUAs identifies significant vulnerabilities: Claude 3.7 Sonnet | CUA demonstrates an Attack Success Rate (ASR) of 42.9%, while Operator, the most secure CUA evaluated, still exhibits an ASR of 7.6%. Notably, CUAs often attempt to execute adversarial tasks with an Attempt Rate as high as 92.5%, although failing to complete them due to capability limitations. Nevertheless, we observe concerning ASRs of up to 50% in realistic end-to-end settings, indicating that CUA threats can already result in tangible risks to users and computer systems. Overall, RedTeamCUA provides an essential framework for advancing realistic, controlled, and systematic analysis of CUA vulnerabilities, highlighting the urgent need for robust defenses to indirect prompt injection prior to real-world deployment.
Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models
David Bani-Harouni ⋅ Chantal Pellegrini ⋅ Paul Stangel ⋅ Ege Özsoy ⋅ Kamilia Zaripova ⋅ Nassir Navab ⋅ Matthias Keicher
A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. Our code is available at https://github.com/pasta99/RewardingDoubt.
GPS: Graph-guided Proactive Information Seeking in Large Language Models
Ruiqing Li ⋅ Yifeng Xu ⋅ Xinke Jiang ⋅ Zhibang Yang ⋅ Xinyu Ma ⋅ Yue Fang ⋅ Junfeng Zhao ⋅ Yasha Wang ⋅ Xu Chu
Equipping Large Language Models (LLMs) with the ability to proactively ask clarifying questions is essential to mitigate ambiguity when faced with underspecified user queries in retrieval-augmented generation (RAG) systems. However, existing methods often neglect the rule-based reasoning structures embedded in the retrieved knowledge that are central to ambiguity, making it challenging to learn an effective and efficient question-asking strategy. To address these issues, we introduce \textbf{GPS}, a two-stage framework for enhancing proactive information seeking abilities of LLMs in RAG systems. In the reasoning stage, we propose a Directed Acyclic Graph (DAG) reasoning structure with theoretical guarantees of logical completeness, which facilitates capturing all conditional logic in the retrieved knowledge and supports effective clarification. In the clarification stage, we design a traversal-based algorithm that dynamically prunes the DAG based on user responses, enabling efficient clarification. To further enhance DAG construction, we first propose a conditional paths guided data synthesis method to address data scarcity challenge, then we apply a clarification-oriented reinforcement learning method with a hybrid reward that jointly considers effectiveness and efficiency to optimize the LLM. Experiments on three benchmarks demonstrate that \textbf{GPS} outperforms baseline methods in both success rate and clarification efficiency.
Learnable Fractional Superlets with a Spectro-Temporal Emotion Encoder for Speech Emotion Recognition
Alaa Nfissi ⋅ Wassim Bouachir ⋅ Nizar Bouguila ⋅ Brian Mishara
Speech emotion recognition (SER) hinges on front-ends that expose informative time-frequency (TF) structure from raw speech. Classical short-time Fourier and wavelet transforms impose fixed resolution trade-offs, while prior ”superlet” variants rely on integer orders and hand-tuned hyperparameters. We revisit TF analysis from first principles and formulate a learnable continuum of superlet transforms. Starting from DC-corrected analytic Morlet wavelets, we define superlets as multiplicative ensembles of wavelet responses and realize learnable fractional orders via softmax-normalized weights over discrete orders, computed as a logdomain geometric mean. We establish admissibility (zero mean) and continuity in order and frequency, and characterize approximate analyticity by bounding negative-frequency leakage as a function of an effective cycle parameter. Building on these results, we introduce the Learnable Fractional Superlet Transform (LFST), a fully differentiable front-end that jointly optimizes (i) a monotone, logspaced frequency grid, (ii) frequency-dependent base cycles, and (iii) learnable fractional-order weights, all trained end-to-end. LFST further includes a learnable asymmetric hard-thresholding (LAHT) module that promotes sparse, denoised TF activations while preserving transients; we provide sufficient conditions for boundedness and stability under mild cycle and grid constraints. To exploit LFST for SER, we design a compact Spectro-Temporal Emotion Encoder (STEE), achieving strong performance with a parameter budget that is orders of magnitude smaller than large self-supervised models, at the cost of additional frontend computation compared to STFT- or LEAF-based baselines. STEE consumes two-channel TF maps, magnitude S and phase-congruency κ, through a compact multi-scale stack with residual temporal and depthwise-frequency blocks, Adaptive FiLM gating, axial (time-axis) self-attention, global attentive pooling, and a lightweight classifier. The full LFST+STEE system is trained in a standard train-validate-test regime using focal loss with optional class rebalancing, and is validated on IEMOCAP, EMO-DB, and the private NSPL-CRISE dataset under standard protocols. By unifying a principled, learnable TF transform with a compact encoder, LFST+STEE replaces ad hoc front-ends with a mathematically grounded alternative that is differentiable, stable, and adaptable to data, enabling systematic ablations over frequency grids, cycle schedules, and fractional orders within a single end-to-end model. The source code of this paper is shared on the GitHub repository: https://github.com/alaaNfissi/LFST-for-SER.
IterResearch: Rethinking Long-Horizon Agents with Interaction Scaling
Guoxin Chen ⋅ Zile Qiao ⋅ Xuanzhong Chen ⋅ Donglei Yu ⋅ Haotian Xu ⋅ Xin Zhao ⋅ Ruihua Song ⋅ Wenbiao Yin ⋅ Huifeng Yin ⋅ Liwen Zhang ⋅ Kuan Li ⋅ Minpeng Liao ⋅ Yong Jiang ⋅ Pengjun Xie ⋅ Fei Huang ⋅ Jingren Zhou
Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce \textbf{IterResearch}, a novel iterative deep-research paradigm that revisits long-horizon research through the lens of Interaction Scaling. Instead of relying on linear context accumulation, we adopt an MDP-inspired architecture with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. To effectively train this paradigm, we employ Efficiency-Aware Policy Optimization (EAPO), a training strategy that adapts geometric reward discounting to incentivize efficient exploration and utilizes adaptive downsampling for stable distributed training. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.
Data Selection for LLM Alignment Using Fine-Grained Preferences
Jia Zhang ⋅ Yao Liu ⋅ Chen-Xi Zhang ⋅ Yi Liu ⋅ Yi-Xuan Jin ⋅ Lan-Zhe Guo ⋅ Yu-Feng Li
Large language models (LLMs) alignment aims to ensure that the behavior of LLMs meets human preferences. While collecting data from multiple fine-grained, aspect-specific preferences becomes more and more feasible, existing alignment methods typically work on a single preference and thus struggle with conflicts inherent in such aggregated datasets. As one early attempt, in this paper, we propose a data-centric approach to align LLMs through the effective use of fine-grained preferences. Specifically, we formulate the problem as a direct fine-grained preference optimization and introduce preference divergence (PD) that quantifies inter-aspect preference conflicts. Instead of directly tackling the consequent complicated optimization, we recast it as a data selection problem and propose a simple yet effective strategy, which identifies a subset of data corresponding to the most negative PD values, for efficient training. We theoretically analyze the loss-bound optimality of our selection strategy and conduct extensive empirical studies on varied settings and datasets to demonstrate that our practical selection method could achieve consistent improvement against standard full-data alignment, using even just 30% of the data. Our work shares a line that LLM alignment using fine-grained preferences is highly feasible.
VeriRole: Verifiable Role-Awareness through Hint-Guided Reinforcement Learning
Zongsheng Wang ⋅ Kaili Sun ⋅ Bowen Wu ⋅ qun yu ⋅ Ying Li ⋅ Xu Chen ⋅ Baoxun Wang
Maintaining role-awareness in Role-Playing Conversational Agents (RPCAs) is a significant challenging, largely because the creative nature of role-playing makes it difficult to design verifiable reward signals for reinforcement learning (RL). To address this, we propose VeriRole, a new framework designed to enhance the role-awareness of agents through a structured, verifiable reasoning process. The core of our framework is a 'hint' mechanism, designed to first extract deterministic cues from the context, before the main response generation.Building on these hints, we introduce a Verifiable Role-Awareness Reward (VRAR) to provide a verifiable signal for role-awareness. Experimental results demonstrate the effectiveness of our approach. Our Qwen2.5-32B model, optimized with VeriRole, achieves an 18.9% and 4.55% increase in average scores on the RAIDEN and CharacterEval benchmarks, respectively. These results confirm that VeriRole can effectively quantify and improve role-awareness, leading to superior persona consistency and robustness. To ensure reproducibility, all prompts are detailed in the Appendix, and the associated training data has been made publicly available.
Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents
Yuxin Liu ⋅ Mingye Zhu ⋅ Siyuan Liu ⋅ Bo Hu ⋅ Lei Zhang
The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies—static prompt engineering or costly fine-tuning—fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce Persona Dynamic Decoding (PDD) framework that consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Zhiheng Xi ⋅ Xin Guo ⋅ Yang Nan ⋅ Enyu Zhou ⋅ Junrui Shen ⋅ Wenxiang Chen ⋅ Jiaqi Liu ⋅ Jixuan Huang ⋅ Xun Deng ⋅ Zhihao Zhang ⋅ Honglin Guo ⋅ Zhikai Lei ⋅ Miao Zheng ⋅ Guoteng Wang ⋅ Peng Sun ⋅ Rui Zheng ⋅ Hang Yan ⋅ Tao Gui ⋅ Qi Zhang ⋅ Xuanjing Huang
Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings—where stale data from past policies are used for training—improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios—including sample replay and partial rollout—BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning
Fengji Zhang ⋅ Xinyao Niu ⋅ Chengyang Ying ⋅ Guancheng Lin ⋅ Zhongkai Hao ⋅ Zhou Fan ⋅ Chengen Huang ⋅ Jacky Keung ⋅ Bei Chen ⋅ Junyang Lin
Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4$% across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2$%). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at https://github.com/zfj1998/A2Search.
The Limits of Inference Scaling Through Resampling
Benedikt Stroebl ⋅ Sayash Kapoor ⋅ Arvind Narayanan
Recent research has generated hope that inference scaling, such as resampling solutions until they pass verifiers like unit tests, could allow weaker models to match stronger ones. Beyond inference, this approach also enables training reasoning models, where data is curated using rejection sampling against a verifier. However, we show that this approach is fundamentally limited when verifiers are imperfect and have a non-zero probability of producing false positives. Resampling cannot decrease this probability, so it imposes an upper bound to the accuracy of resampling-based inference scaling, regardless of compute budget. Our analysis shows that there is a strong correlation between the model’s single-sample accuracy and its false positive rate on HumanEval and MBPP, whose unit tests have limited coverage. Therefore, no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model. Empirical results show that optimal sampling attempts are often fewer than 10, as the negative utility of false positives outweighs benefits, bending inference scaling curves downward. Finally, false positives may have other undesirable qualities, like poor adherence to coding style conventions.
Hierarchical Semantic-Acoustic Modeling via Semi-Discrete Residual Representations for Expressive End-to-End Speech Synthesis
Yixuan Zhou ⋅ Guoyang Zeng ⋅ Xin Liu ⋅ Xiang Li ⋅ Renjie Yu ⋅ Ziyang Wang ⋅ Runchuan Ye ⋅ Weiyue Sun ⋅ Jiancheng Gui ⋅ Kehan Li ⋅ Zhiyong Wu ⋅ Zhiyuan Liu
Generative models for speech synthesis face a fundamental trade-off: discrete tokens ensure stability but sacrifice expressivity, while continuous signals retain acoustic richness but suffer from error accumulation due to task entanglement. This challenge has driven the field towards multi-stage pipelines that rely on pre-trained discrete speech tokenizers, but these create a semantic-acoustic divide, limiting holistic and expressive speech generation. We resolve these dilemma through hierarchical semantic-acoustic modeling with semi-discrete residual representations.Our framework introduces a differentiable quantization bottleneck that induces natural specialization: a Text-Semantic Language Model (TSLM) generates semantic-prosodic plans, while a Residual Acoustic Model (RALM) recovers fine-grained acoustic details.This hierarchical semantic-acoustic representation guides a local diffusion-based decoder to generate high-fidelity speech latents. Critically, the entire architecture is trained end-to-end under a simple diffusion objective, eliminating dependency on external discrete speech tokenizers. Trained on over 1 million hours of speech, our 0.5B-parameter model achieves state-of-the-art zero-shot TTS performance among open-source systems, demonstrating that our approach delivers expressive and stable synthesis. Audio samples are available at: https://voxcpm.github.io/VoxCPM-demopage/.
CONCUR: A Framework for Continual Constrained and Unconstrained Routing
Peter Baile Chen ⋅ Weiyue Li ⋅ Dan Roth ⋅ Mike Cafarella ⋅ Samuel Madden ⋅ Jacob Andreas
AI tasks differ in complexity and are best addressed with different computation strategies (e.g., combinations of models and decoding methods). Hence, an effective routing system that maps tasks to the appropriate strategies is crucial. Most prior methods build the routing framework by training a single model across all strategies, which demands full retraining whenever new strategies appear and leads to high overhead. Attempts at such continual routing, however, often face difficulties with generalization. Prior models also typically use a single input representation, limiting their ability to capture the full complexity of the routing problem and leading to sub-optimal routing decisions. To address these gaps, we propose CONCUR, a continual routing framework that supports both constrained and unconstrained routing (i.e., routing with or without a budget). Our modular design trains a separate predictor model for each strategy, enabling seamless incorporation of new strategies with low additional training cost. Our predictors also leverage multiple representations of both tasks and computation strategies to better capture overall problem complexity. Experiments on both in-distribution and out-of-distribution, knowledge- and reasoning-intensive tasks show that our method outperforms the best single strategy and strong existing routing techniques with higher end-to-end accuracy and lower inference cost in both continual and non-continual settings, while also reducing training cost in the continual setting.
MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning
Jiachun Li ⋅ Shaoping Huang ⋅ Zhuoran Jin ⋅ Chenlong Zhang ⋅ Pengfei Cao ⋅ Yubo Chen ⋅ Kang Liu ⋅ Jun Zhao
Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Kaiyang Wan ⋅ Lang Gao ⋅ Honglin Mu ⋅ Preslav Nakov ⋅ Yuxia Wang ⋅ Xiuying Chen
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://anonymous.4open.science/r/InfoQA-55D1}{InfoQA}.
Confident and Adaptive Generative Speech Recognition via Risk Control
Amit Damri ⋅ Bracha Laufer-Goldshtein
Automatic Speech Recognition (ASR) systems frequently produce transcription errors due to acoustic variability, which require post-processing correction methods. Recent approaches leverage Large Language Models (LLMs) for generative ASR error correction using N-best hypotheses but rely on fixed set sizes regardless of input complexity and do not provide performance guarantees. We propose an adaptive framework that dynamically determines the optimal number of hypotheses for each input using risk control. This mechanism leverages ASR confidence scores and applies Learn then test (LTT) to control the expected relative word error rate degradation compared to the best achievable performance for a given model and hypothesis set. Experimental results demonstrate that our approach provides theoretical guarantees with high-probability bounds while matching or exceeding fixed-size correction baselines and requiring fewer hypotheses on average, achieving substantial computational savings under diverse acoustic conditions.
Learning From Dictionary: Enhancing Robustness of Machine-Generated Text Detection in Zero-Shot Language via Adversarial Training
Yuanfan Li ⋅ Qi Zhou ⋅ Zexuan Xie
Machine-generated text (MGT) detection is critical for safeguarding online content integrity and preventing the spread of misleading information. Although existing detectors achieve high accuracy in monolingual settings, they exhibit severe performance degradation on zero-shot languages and are vulnerable to adversarial attacks. To tackle these challenges, we propose a robust adversarial training framework named Translation-based Attacker Strengthens MulTilingual DefEnder (TASTE). TASTE comprises two core components: an attacker that performs code-switching by querying translation dictionaries to generate adversarial examples, and a detector trained to resist these attacks while generalizing to unseen languages. We further introduce a novel Language-Agnostic Adversarial Loss (LAAL), which encourages the detector to learn language-invariant feature representations and thus enhances zero-shot detection performance and robustness against unseen attacks. Additionally, the attacker and detector are synchronously updated, enabling continuous improvement of defensive capabilities. Experimental results on 9 languages and 8 attack types show that our TASTE surpasses 8 SOTA detectors, improving the average F1 score by 0.064 and reducing the average Attack Success Rate (ASR) by 3.8\%. Our framework offers a promising approach for building robust, multilingual MGT detectors with strong generalization to real-world adversarial scenarios. Our codes are available in https://github.com/Liyuuuu111/MGT-Eval, and our datasets and pretrained checkpoint are available in https://drive.google.com/file/d/1w1hbdiZMS_JzPntVMWM3qrTQ4KxJf-t6.
LiveWeb-IE: A Benchmark For Online Web Information Extraction
Seungbin Yang ⋅ Jihwan Kim ⋅ Jaemin Choi ⋅ Dongjin Kim ⋅ Soyoung Yang ⋅ ChaeHun Park ⋅ Jaegul Choo
Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce LiveWeb-IE, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.
Differential Fine-Tuning Large Language Models Towards Better Diverse Reasoning Abilities
Xiaosong Yuan ⋅ Chen Shen ⋅ Shaotian Yan ⋅ kaiyuan liu ⋅ Xiaofeng Zhang ⋅ Sinan Fan ⋅ Liang Xie ⋅ Wenxiao Wang ⋅ Renchu Guan ⋅ Ying Wang ⋅ Jieping Ye
Reasoning abilities of large language models (LLMs) require explicit derivations compared to general question-answering, supervised fine-tuning (SFT) can empower multiple reasoning abilities in LLMs via learning from various datasets. However, neither training the datasets jointly (mix-up) nor continually can maintain the performance of single-dataset SFT, sometimes better while sometimes even worse, illustrating vanilla SFT can not only facilitate reasoning abilities but also introduce conflicts. In this paper, we propose a novel framework to mitigate the conflicts and preserve benefits among different reasoning tasks, and even surpass each task's single dataset SFT performance. We start by exploring the differences between reasoning fine-tuned and base LLMs by analyzing their parameter variations during model inference, and we discover that each reasoning capability has exclusive parameters that benefit it more evidently than others. In contrast, the overlapped parameters of tasks can bring benefits or conflicts. Inspired by the findings, we propose to update the exclusive and overlapped parameters according to specific reasoning task combinations differentially, thereby avoiding unnecessary conflicts while maintaining benefits. Consistent improvements in mix-up and continual SFT experiments demonstrate that the proposed SFT strategy can achieve better performance on various LLMs (Llama3-8B, Mistral-7B, and Qwen2.5-14B) and diverse reasoning tasks with fewer conflicts, showing the superiority and generality of our analysis findings and the proposed approach.
AssoMem: Scalable Memory QA with Multi-Signal Associative Retrieval
Kai Zhang ⋅ Xinyuan Zhang ⋅ Ejaz Ahmed ⋅ Hongda Jiang ⋅ Caleb Kumar ⋅ Kai Sun ⋅ Zhaojiang Lin ⋅ Sanat Sharma ⋅ Shereen Oraby ⋅ AARON COLAK ⋅ Ahmed Aly ⋅ Anuj Kumar ⋅ Xiaozhong Liu ⋅ Xin Dong
Accurate recall from large-scale memories remains a core challenge for memory-augmented AI assistants performing question answering (QA), especially in similarity-dense scenarios where existing methods mainly rely on semantic distance to the query for retrieval. Inspired by how humans link information associatively, we propose AssoMem, a novel framework constructing an associative memory graph that anchors dialogue utterances to automatically extracted clues. This structure provides a rich organizational view of the conversational context and facilitates importance-aware ranking. Further, AssoMem integrates multi-dimensional retrieval signals—relevance, importance, and temporal alignment—using an adaptive mutual information (MI)-driven fusion strategy. Extensive experiments across three benchmarks and a newly introduced dataset, MeetingQA, demonstrate that AssoMem consistently outperforms state-of-the-art baselines, verifying its superiority in context-aware memory recall.
ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge
Zhilin Wang ⋅ Jaehun Jung ⋅ Ximing Lu ⋅ Shizhe Diao ⋅ Ellie Evans ⋅ Jiaqi Zeng ⋅ Pavlo Molchanov ⋅ Yejin Choi ⋅ Jan Kautz ⋅ Yi Dong
Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9\% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks.
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
Anindya Sundar Das ⋅ Kangjie Chen ⋅ Monowar Bhuyan
Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.
Non-Collaborative User Simulators for Tool Agents
Jeonghoon Shim ⋅ Woojung Song ⋅ Cheyon Jin ⋅ Seungwon KooK ⋅ Yohan Jo
Tool agents interact with users through multi-turn dialogues to accomplish various tasks. Recent studies have adopted user simulation methods to develop these agents in multi-turn settings. However, existing user simulators tend to be agent-friendly, exhibiting only cooperative behaviors, failing to train and test agents against non-collaborative users in the real world. We propose a novel user simulator architecture that simulates four categories of non-collaborative behaviors: requesting unavailable services, digressing into tangential conversations, expressing impatience, and providing incomplete utterances. Our user simulator can simulate challenging and natural non-collaborative behaviors while reliably delivering all intents and information necessary to accomplish the task. Our experiments on MultiWOZ and $\tau$-bench reveal significant performance degradation in state-of-the-art tool agents when encountering non-collaborative users, as well as agent weaknesses under each non-collaborative condition such as escalated hallucinations and dialogue breakdowns. Our findings point to the need for methods that can improve agent robustness to the wide range of user behaviors encountered in deployment. We release the extensible simulation framework to help the community develop and stress-test tool agents under realistic conditions within their own service domains.
TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization
Waris Quamer ⋅ Mu-Ruei Tseng ⋅ Ghady Nasrallah ⋅ Ricardo Gutierrez-Osuna
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.
PERSONA: Dynamic and Compositional Inference-Time Personality Control via Activation Vector Algebra
Xiachong Feng ⋅ Liang Zhao ⋅ Weihong Zhong ⋅ yichong huang ⋅ Yuxuan Gu ⋅ Lingpeng Kong ⋅ Xiaocheng Feng ⋅ Bing Qin
Current methods for personality control in Large Language Models rely on static prompting or expensive fine-tuning, failing to capture the dynamic and compositional nature of human traits. We introduce PERSONA, a training-free framework that achieves fine-tuning level performance through direct manipulation of personality vectors in activation space. Our key insight is that personality traits appear as extractable, approximately orthogonal directions in the model's representation space that support algebraic operations. The framework operates through three stages: Persona-Base extracts orthogonal trait vectors via contrastive activation analysis; Persona-Algebra enables precise control through vector arithmetic (scalar multiplication for intensity, addition for composition, subtraction for suppression); and Persona-Flow achieves context-aware adaptation by dynamically composing these vectors during inference. On PersonalityBench, our approach achieves a mean score of 9.60, nearly matching the supervised fine-tuning upper bound of 9.61 without any gradient updates. On our proposed Persona-Evolve benchmark for dynamic personality adaptation, we achieve up to 91% win rates across diverse model families. These results provide evidence that aspects of LLM personality are mathematically tractable, opening new directions for interpretable and efficient behavioral control.
Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency. Our source code is available at https://github.com/BorealisAI/EBCAR.
Smarter Not Harder: Generative Process Evaluation with Intrinsic-Signal Driving and Ability‑Adaptive Reward Shaping
Tao He ⋅ Rongchuan Mu ⋅ Lizi Liao ⋅ Yixin Cao ⋅ Yang Li ⋅ Yijia Luo ⋅ Weixun Wang ⋅ Ming Liu ⋅ Bing Qin
Large reasoning models (LRMs) have shown strong performance in complex mathematical reasoning when optimized via reinforcement learning (RL). However, conventional outcome-only reward provides sparse feedback, leading to inefficient optimization. In this work, we investigate whether generative process reward models (GenPRMs) can accelerate RL training of LRMs by improving the utilization of reasoning trajectories. We first analyze critical limitations in existing GenPRMs, including their heavy reliance on reasoning ability during correctness judgment, and suppression of exploration as well as vulnerability to reward hacking during reward assignment. To address these limitations, we first propose a novel \textbf{intrinsic-signal-driven evaluation} mechanism, which judges reasoning steps using semantic cues from the solution, thus mitigating extensive dependence on GenPRM. Furthermore, we (i) adopt \textbf{thought-level rewarding granularity} to alleviate over-dense step rewards, and (ii) design a \textbf{difficulty-aware reward formulation} that dynamically balances exploration and exploitation and keeping the optimization target of key tokens to mitigate reward hacking. We integrate these innovations into the process reward-based GRPO, resulting in the proposed \textbf{TP-GRPO} algorithm. Experiments on LRMs with 1.5B and 7B parameters show that TP-GRPO achieves higher improvements while using significantly fewer training samples, and more analyses further confirm the effectiveness of our proposed process evaluation mechanism.
CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs
Li Li ⋅ Ziyi Wang ⋅ Yongliang Wu ⋅ Jianfei Cai ⋅ xu yang
Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher–student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.
Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models
Yinjie Wang ⋅ Ling Yang ⋅ Bowen Li ⋅ Ye Tian ⋅ Ke Shen ⋅ Mengdi Wang
The extension of diffusion models to language tasks has shown promising results, but their post-training methods remain largely unexplored. We highlight the importance of aligning a diffusion language model’s preference-inference trajectory with its post-training objective. To this end, we propose TraceRL, a trajectory-aware reinforcement learning framework for DLMs that incorporates information from inference trajectories into post-training and is applicable to both full-attention and block-attention diffusion models. We also introduce a diffusion-based value model that enhances training stability and naturally accommodates process rewards. We demonstrate TraceRL’s superiority in enhancing a model’s reasoning ability on complex math and coding tasks, as well as its applicability in scaling block diffusion models to larger block sizes. Employing TraceRL, we derive a series of state-of-the-art diffusion language models, namely TraDo. Although smaller than Qwen2.5-7B-Instruct, TraDo-4B-Instruct consistently outperforms it on complex math reasoning tasks. TraDo-8B-Instruct achieves 4.5% higher accuracy on MATH500 than Qwen2.5-7B-Instruct and 6.6% higher accuracy on LiveCodeBench-V2 than Llama3.1-8B-Instruct. Through curriculum learning, we also develop the first 8B-scale long-CoT diffusion language model. We open-source our code at https://github.com/Gen-Verse/dLLM-RL.
Beyond Markovian Drifts: Action-Biased Geometric Walks with Memory for Personalized Summarization
Parthiv Chatterjee ⋅ Asish Batha ⋅ Tashvi patel ⋅ Sourish Dasgupta ⋅ Tanmoy Chakraborty
Document summarization helps readers focus on the "content-of-interest", a *subjective* and *time-variant* quantity. Capturing this *dynamic subjectivity* requires modeling how user preferences evolve over time, thereby demanding *personalized summarization*. Recent news recommendation and summarization models often assume that preferences follow a *memoryless or short-memory random walk* on interaction graphs, i.e., a Markovian diffusion seeded at the latest interaction or compressed into a short hidden state or prompt. We ask whether such a hypothesis also holds for personalized summarization. To test this, we propose **Walk2Pers**, a lightweight encoder–decoder framework that extends the walk view with *action-conditioned geometric steps*, decomposed into (i) a *magnitude* controlling shift strength and (ii) an *orientation* capturing continuity vs. novelty. The process is mediated by dual memory lanes that reinforce consistent interests while suppressing disinterest, and is augmented with a drift term for summary requests. We show theoretically that such structured walks approximate first-order action-conditioned kernels, and empirically validate the hypothesis on PENS, OpenAI-Reddit, and PersonalSum. Using PerSEval, a personalization metric with strong human correlation, Walk2Pers outperforms specialized personalized summarizers by an average of $0.41 \uparrow$, and strong LLM baselines (DeepSeek-R1-14B, LLaMA-2-13B, Mistral-7B, Zephyr-7B) by $0.22 \uparrow$. Analyses further confirm cross-domain robustness ($0.19 \uparrow$ over the best LLM) and stability on long histories. Together, these results support viewing personalized summarization as an *action-biased geometric walk with memory*, offering both interpretability and efficiency.
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration
Yijia Shao ⋅ Vinay Samuel ⋅ Yucheng Jiang ⋅ John Yang ⋅ Diyi Yang
While the advancement of large language models has spurred the development of AI agents to automate tasks, numerous use cases inherently require agents to collaborate with humans due to humans' latent preferences, domain expertise, or the need for control. To facilitate the study of human-agent collaboration, we introduce Collaborative Gym (Co-Gym), an open framework for developing and evaluating collaborative agents that engage in bidirectional communication with humans while interacting with task environments. We describe how the framework enables the implementation of new task environments and coordination between humans and agents through a flexible, non-turn-taking interaction paradigm, along with an evaluation suite that assesses both collaboration outcomes and processes. Our framework provides both a simulated condition with a reliable user simulator and a real-world condition with an interactive web application. Initial benchmark experiments across three representative tasks-creating travel plans, writing related work sections, and analyzing tabular data-demonstrate the benefits of human-agent collaboration: The best-performing collaborative agents consistently outperform their fully autonomous counterparts in task performance, achieving win rates of 86% in Travel Planning, 74% in Tabular Analysis, and 66% in Related Work when evaluated by real users. Despite these improvements, our evaluation reveals persistent limitations in current language models and agents, with communication and situational awareness failures observed in 65% and 40% of cases in the real condition, respectively. Released under the permissive MIT license, Co-Gym supports the addition of new task environments and can be used to develop collaborative agent applications, while its evaluation suite enables assessment and improvement of collaborative agents.
Dyna-Mind: Learning to Simulate from Experience for Better AI Agents
Xiao Yu ⋅ Baolin Peng ⋅ Michel Galley ⋅ Hao Cheng ⋅ Qianhui Wu ⋅ Janardhan Kulkarni ⋅ Suman Nath ⋅ Zhou Yu ⋅ Jianfeng Gao
Reasoning models have recently shown remarkable progress in domains such as math and coding. However, their expert-level abilities in math and coding contrast sharply with their performance in long-horizon, interactive tasks such as web navigation and computer/phone-use. Inspired by literature on human cognition, we argue that current AI agents need ``vicarious trial and error'' - the capacity to mentally simulate alternative futures before acting - in order to enhance their understanding and performance in complex interactive environments. We introduce Dyna-Mind, a two-stage training framework that explicitly teaches (V)LM agents to integrate such simulation into their reasoning. In stage 1, we introduce Reasoning with Simulations (ReSim), which trains the agent to generate structured reasoning traces from expanded search trees built from real experience gathered through environment interactions. ReSim thus grounds the agent's reasoning in faithful world dynamics and equips it with the ability to anticipate future states in its reasoning. In stage 2, we propose Dyna-GRPO, an online reinforcement learning method to further strengthen the agent's simulation and decision-making ability by using both outcome rewards and intermediate states as feedback from real rollouts. Experiments on two synthetic benchmarks (Sokoban and ALFWorld) and one realistic benchmark (AndroidWorld) demonstrate that (1) ReSim effectively infuses simulation ability into AI agents, and (2) Dyna-GRPO leverages outcome and interaction-level signals to learn better policies for long-horizon, planning-intensive tasks. Together, these results highlight the central role of simulation in enabling AI agents to reason, plan, and act more effectively in the ever more challenging environments.
Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization
Taeyoon Kwon ⋅ Dongwook Choi ⋅ Hyojun Kim ⋅ Sunghwan Kim ⋅ Seungjun Moon ⋅ Beong-woo Kwak ⋅ Kuan-Hao Huang ⋅ Jinyoung Yeo
LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents' memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct Memento, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks.
LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation
Jude Khouja ⋅ Lingyi Yang ⋅ Karolina Korgul ⋅ Simi Hellsten ⋅ Vlad A. Neacșu ⋅ Harry Mayne ⋅ Ryan Kearns ⋅ Andrew Bean ⋅ Adam Mahdi
Frontier language models demonstrate increasing ability at solving reasoning problems, but their performance is often inflated by circumventing reasoning and instead relying on their expanding knowledge and memorisation capacity. We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert- designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while reducing the likelihood problems are solvable with via knowledge or memorisation. Our experiments show that models exploit shortcuts on the original question as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around 0.59 on original problems to 0.48 after obfuscation. LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure of true reasoning capabilities.
Large Language Models (LLMs) can compose poetry, but how far are they from human poets? In this paper, we introduce POEMetric, the first comprehensive framework for poetry evaluation, examining 1) basic instruction-following abilities in generating poems according to a certain form and theme, 2) advanced abilities of showing creativity, lexical diversity, and idiosyncrasy, evoking emotional resonance, and using imagery and literary devices, and 3) general appraisal of the overall poem quality and estimation of authorship. We curated a human poem dataset - 203 English poems of 7 fixed forms annotated with meter, rhyme patterns and themes - and experimented with 30 LLMs for poetry generation based on the same forms and themes of the human data, totaling 6,090 LLM poems. Based on POEMetric, we assessed the performance of both human poets and LLMs through rule-based evaluation and LLM-as-a-judge, whose results were validated by human experts. Results show that, though the top model achieved high form accuracy (4.26 out of 5.00, with Gemini-2.5-Pro as a judge; same below) and theme alignment (4.99), all models failed to reach the same level of advanced abilities as human poets, who achieved unparalleled creativity (4.02), idiosyncrasy (3.95), emotional resonance (4.06), and skillful use of imagery (4.49) and literary devices (4.67). Humans also defeated the best-performing LLM in overall poem quality (4.22 vs. 3.20). As such, poetry generation remains a formidable challenge for LLMs.
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park ⋅ Jueun Kim ⋅ Wook-Shin Han
Real-world Table–Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated—and therefore error-prone—and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table–Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question–answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. We will release the benchmark, construction code, and baseline results to spur progress toward robust, realistic Table–Text QA models.
GRO-RAG: Gradient-aware Re-rank Optimization for Multi-source Retrieval-Augmented Generation
Siyuan Chen ⋅ Hang Ding ⋅ Xiaoyu Kang ⋅ Jiechao Gao
Retrieval-Augmented Generation (RAG) systems often rely on information retrieved from heterogeneous sources to support generation tasks. However, existing approaches typically either aggregate all sources uniformly or statically select a single source, neglecting semantic complementarity. Moreover, they commonly employ re-ranking models to obtain Top-k documents, without accounting for actual contribution to generation objective. In this paper, we propose GRO-RAG, a training-free, gradient-aware re-ranking framework for multi-source RAG. Our method performs Top-k document selection by reading gradients from the language model, estimating each document’s contribution to the generation loss through a single backward pass. This enables re-ranking not by heuristic relevance, but by direct feedback from LLM's generation objective. At the source level, we incorporate inter-source redundancy and query relevance to select source combination prior to re-ranking. Theoretically, we prove that this gradient-based Top-k selection approximates the optimal subset minimizing the generation loss, and aligns with minimizing the leave-one-out loss upper bound. Experiments across multi-source QA and open-domain generation tasks demonstrate consistent improvements in generation quality, highlighting the importance of generation-aware retrieval selection in multi-source RAG.
AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Hao Zhu ⋅ Phil Cuvin ⋅ Xinkai Yu ⋅ Charlotte Yan ⋅ Jason Zhang ⋅ Diyi Yang
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose AutoLibra, a framework for agent evaluation, that transforms open-ended human feedback e.g. “If you find that the button is disabled, don’t click it again”, or “This agent has too much autonomy to decide what to do on its own” into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent’s behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta-metrics to evaluate the alignment of a set of (induced) metrics with open feedback: “coverage” and “redundancy”. Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra’s ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
Kazuki Fujii ⋅ Yukito Tajima ⋅ Sakae Mizuki ⋅ Masaki Kawamura ⋅ Hinari Shimada ⋅ Taihei Shiotani ⋅ Koshiro Saito ⋅ Masanari Oi ⋅ Taishi Nakamura ⋅ Okamoto ⋅ Shigeki Ishida ⋅ Kakeru Hattori ⋅ Youmi Ma ⋅ Hiroya Takamura ⋅ Rio Yokota ⋅ Jun Sakuma ⋅ Naoaki Okazaki
The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +16.1 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting yielding the largest gains. By releasing datasets, prompts, checkpoints, and pipeline code, we ensure reproducibility and provide a transferable transform-and-retain methodology that can be adapted to other base models and LLM rewriting setups.
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development
Qixing Zhou ⋅ JiaCheng Zhang ⋅ Haiyang Wang ⋅ Rui Hao ⋅ Jiahe Wang ⋅ Minghao Han ⋅ Yuxue Yang ⋅ Shuzhe Wu ⋅ Feiyang Pan ⋅ Lue Fan ⋅ Dandan Tu ⋅ Zhaoxiang Zhang
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.
A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning
Qianben Chen ⋅ Jingyi Cao ⋅ Jiayu Zhang ⋅ Tianrui Qin ⋅ LiXiaowan ⋅ Zhu ⋅ Dingfeng Shi ⋅ He Zhu ⋅ Minghao Liu ⋅ Xiaobo Liang ⋅ Ge Zhang ⋅ Jian Yang ⋅ Yuchen Jiang ⋅ Wangchunshu Zhou
Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third instant mode that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer—cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
Yuhan Song ⋅ Linhao Zhang ⋅ Chuhan Wu ⋅ Aiwei Liu ⋅ Wei Jia ⋅ Houfeng Wang ⋅ Zhou Xiao
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks. Our code and model are publicly available at https://github.com/Tencent/StableToken.
KnowProxy: Adapting Large Language Models by Knowledge-guided Proxy
Gukhyeon Lee ⋅ Yeachan Kim ⋅ SANGKEUN LEE
Adapting large language models (LLMs) using smaller proxy models has been shown to improve training efficiency, where the LLMs remain frozen while the proxies are tuned on top. However, this approach typically requires access to the output probability distributions of LLMs, which are often inaccessible or unstable. To address this limitation, we propose KnowProxy, a knowledge-guided proxy framework in which the proxy is trained with textual knowledge rather than probability distributions. Specifically, we first elicit textual knowledge and reasoning from frozen LLMs through prompting, and then the proxy model learns to adapt this reasoning to target task distributions. We evaluate KnowProxy on diverse reasoning benchmarks with different fine-tuning scenarios. Comprehensive results show that KnowProxy achieves competitive or even better performance without direct access to probability distributions, thereby providing a scalable and versatile alternative to traditional fine-tuning.
Heuristic-Based Ideation for Guiding LLMs Toward Structured Creativity
Xiao Liu ⋅ Haokun Liu ⋅ Chenhao Tan
Large Language Models (LLMs) hold immense promise for accelerating scientific discovery, yet current LLM-based ideation methods often rely on ad-hoc strategies rather than systematic frameworks. This blog introduces Ideation Heuristics, a systematic approach that formalizes 20 cognitive heuristics that structure how researchers generate new ideas. We show that researchers across disciplines find these heuristics highly useful, and we demonstrate how they can be operationalized through Claude skills.
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
Jialin Yang · Dongfu Jiang · Tony He · Sherman Siu · Yuxuan Zhang · Disen Liao · Zhuofeng Li · Huaye Zeng · Yiming Jia · Haozhe Wang · Benjamin Schneider · Chi Ruan · Wentao Ma · Zhiheng Lyu · Yifei Wang · Yi Lu · Quy Duc Do · Ziyan Jiang · Ping Nie · Wenhu Chen
As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
Learning From the Past with Cascading Eligibility Traces
Tokiniaina Raharison Ralambomihanta ⋅ Ivan Anokhin ⋅ Roman Pogodin ⋅ Samira Ebrahimi Kahou ⋅ Jonathan Cornford ⋅ Blake A Richards
Animals often receive information about errors and rewards after significant delays. In some cases these delays are fixed aspects of neural processing or sensory feedback, for example, there is typically a delay of tens to hundreds of milliseconds between motor actions and visual feedback. The standard approach to handling delays in models of synaptic plasticity is to use eligibility traces. However, standard eligibility traces that decay exponentially mix together any events that happen during the delay, presenting a problem for any credit assignment signal that occurs with a significant delay. Here, we show that eligibility traces formed by a state-space model, inspired by a cascade of biochemical reactions, can provide a temporally precise memory for handling credit assignment at arbitrary delays. We demonstrate that these cascading eligibility traces (CETs) work for credit assignment at behavioral time-scales, ranging from seconds to minutes. As well, we can use CETs to handle extremely slow retrograde signals, as have been found in retrograde axonal signaling. These results demonstrate that CETs can provide an excellent basis for modeling synaptic plasticity.
Evolution and compression in LLMs: on the emergence of human-aligned categorization
Nathaniel Imel ⋅ Noga Zaslavsky
Converging evidence suggests that human systems of semantic categories achieve near-optimal compression via the Information Bottleneck (IB) complexity-accuracy tradeoff. Large language models (LLMs) are not trained for this objective, which raises the question: are LLMs capable of evolving efficient human-aligned semantic systems? To address this question, we focus on color categorization --- a key testbed of cognitive theories of categorization with uniquely rich human data --- and replicate with LLMs two influential human studies. First, we conduct an English color-naming study, showing that LLMs vary widely in their complexity and English-alignment, with larger instruction-tuned models achieving better alignment and IB-efficiency. Second, to test whether these LLMs simply mimic patterns in their training data or actually exhibit a human-like inductive bias toward IB-efficiency, we simulate cultural evolution of pseudo color-naming systems in LLMs via a method we refer to as Iterated in-Context Language Learning (IICLL). We find that akin to humans, LLMs iteratively restructure initially random systems towards greater IB-efficiency. However, only a model with strongest in-context capabilities (Gemini 2.0) is able to recapitulate the wide range of near-optimal IB-tradeoffs observed in humans, while other state-of-the-art models converge to low-complexity solutions. These findings demonstrate how human-aligned semantic categories can emerge in LLMs via the same fundamental principle that underlies semantic efficiency in humans.
Inferring brain plasticity rule under long-term stimulation with structured recurrent dynamics
Zhichao Liang ⋅ Jingzhe Lin ⋅ Xinyi Li ⋅ Guanyi Zhao ⋅ Quanying Liu
Understanding how long-term stimulation reshapes neural circuits requires uncovering the rules of brain plasticity. While short-term synaptic modifications have been extensively characterized, the principles that drive circuit-level reorganization across hours to weeks remain unknown. Here, we formalize these principles as a latent dynamical law that governs how recurrent connectivity evolves under repeated interventions. To capture this law, we introduce the Stimulus-Evoked Evolution Recurrent dynamics (STEER) framework, a dual-timescale model that disentangles fast neural activity from slow plastic changes. STEER represents plasticity as low-dimensional latent coefficients evolving under a learnable recurrence, enabling testable inference of plasticity rules rather than absorbing them into black-box parameters. We validate STEER with four benchmarks: synthetic Lorenz systems with controlled parameter shifts, BCM-based networks with biologically grounded plasticity, a task learning setting with adaptively optimized external stimulation and longitudinal recordings from Parkinsonian rats receiving closed-loop DBS. Our results demonstrate that STEER recovers interpretable update equations, predicts network adaptation under unseen stimulation schedules, and supports the design of improved intervention protocols. By elevating long-term plasticity from a hidden confound to an identifiable dynamical object, STEER provides a data-driven foundation for both mechanistic insight and principled optimization of brain stimulation. The source code of this study is available at https://github.com/ncclab-sustech/STEER.git.
Estimating Dimensionality of Neural Representations from Finite Samples
Chanwoo Chun ⋅ Abdulkadir Canatar ⋅ SueYeon Chung ⋅ Daniel Lee
The global dimensionality of a neural representation manifold provides rich insight into the computational process underlying both artificial and biological neural networks. However, all existing measures of global dimensionality are sensitive to the number of samples, i.e., the number of rows and columns of the sample matrix. We show that, in particular, the participation ratio of eigenvalues, a popular measure of global dimensionality, is highly biased with small sample sizes, and propose a bias-corrected estimator that is more accurate with finite samples and with noise. On synthetic data examples, we demonstrate that our estimator can recover the true known dimensionality. We apply our estimator to neural brain recordings, including calcium imaging, electrophysiological recordings, and fMRI data, and to the neural activations in a large language model, and show that our estimator is invariant to the sample size. Finally, our estimators can additionally be used to measure the local dimensionalities of curved neural manifolds by weighting the finite samples appropriately.
The Human Brain as a Dynamic Mixture of Expert Models in Video Understanding
Christina Sartzetaki ⋅ Anne Zonneveld ⋅ Pablo Oyarzo ⋅ Alessandro Gifford ⋅ Radoslaw Cichy ⋅ Pascal Mettes ⋅ Iris Groen
The human brain is the most efficient and versatile system for processing dynamic visual input. By comparing representations from deep video models to brain activity, we can gain insights into mechanistic solutions for effective video processing, important to better understand the brain and to build better models. Current works in model-brain alignment primarily focus on fMRI measurements, leaving open questions about fine-grained dynamic processing. Here, we introduce the first large-scale model benchmarking on alignment to dynamic electroencephalography (EEG) recordings of short natural videos. We analyze 100+ models across the axes of temporal integration, classification task, architecture, and pretraining, using our proposed Cross-Temporal Representational Similarity Analysis (CT-RSA) which matches the best time-unfolded model features to dynamically evolving brain responses, distilling $10^7$ alignment scores. Our findings reveal novel insights on how continuous visual input is integrated in the brain, beyond the standard temporal processing hierarchy from low to high-level representations. After initial alignment to hierarchical static object processing, responses in posterior electrodes best align to mid-level temporally-integrative action features, showing high temporal correspondence to feature timings. In contrast, responses in frontal electrodes best align with high-level static action representations and show no temporal correspondence to the video. Additionally, temporally-integrating state-space models show superior alignment to intermediate posterior activity, in which self-supervised pretraining is also beneficial. We draw a metaphor to a dynamic mixture of expert models for the changing neural preference in tasks and temporal integration reflected in the alignment to different model types across time. We posit that a single best-aligned model would need such training and architecture as to allow combining and dynamically switching between these capacities.
Towards Interpretable Visual Decoding with Attention to Brain Representations
Pinyuan Feng ⋅ Hossein Adeli ⋅ Wenxuan Guo ⋅ Fan Cheng ⋅ Ethan Hwang ⋅ Nikolaus Kriegeskorte
Recent work has demonstrated that complex visual stimuli can be decoded from human brain activity using deep generative models, offering new ways to probe how the brain represents real-world scenes. However, many existing approaches first map brain signals into intermediate image or text feature spaces before guiding the generative process, which obscures the contributions of different brain areas to the final reconstruction output. In this work, we propose NeuroAdapter, a visual decoding framework that directly conditions a latent diffusion model on brain representations, bypassing the need for intermediate feature spaces. Our method demonstrates competitive visual reconstruction quality on public fMRI datasets compared to prior work, while providing greater transparency into how brain signals drive visual reconstruction. To this end, we introduce an Image–Brain BI-directional interpretability framework (IBBI) that analyzes cross-attention patterns across diffusion denoising steps to reveal how different cortical areas influence the unfolding generative trajectory. Our work highlights the potential of end-to-end brain-to-image reconstruction and establishes a path for interpretable neural decoding.
CerebraGloss: Instruction-Tuning a Large Vision-Language Model for Fine-Grained Clinical EEG Interpretation
Wei Gu ⋅ Tianming Luo ⋅ Qiran Zhang ⋅ Mohan Ye ⋅ Xiao Shen ⋅ Wenxin Chen ⋅ Yunhuan Li ⋅ Yichen Zhang ⋅ Jing Hong ⋅ Bao-liang Lu ⋅ Wei-Long Zheng
Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Using this data, we develop CerebraGloss, the first model of its kind capable of unified, generative analysis—performing tasks from detailed waveform description to multi-turn, context-aware dialogue. To evaluate this new capability, we construct and release CerebraGloss-Bench, a comprehensive benchmark for open-ended EEG interpretation. CerebraGloss demonstrates strong performance, surpassing leading LVLMs, including proprietary models like GPT-5, on this benchmark and achieving a new state-of-the-art on the TUSZ seizure detection task. Models, benchmark and tools are available at https://github.com/iewug/CerebraGloss.
CogMoE: Signal-Quality–Guided Multimodal MoE for Cognitive Load Prediction
Aamir Shah ⋅ Yu Wen ⋅ Renjie Hu ⋅ Jiefu Chen ⋅ Jose Contreras-Vidal ⋅ Xuqing Wu ⋅ Xin Fu
The poor and variable quality of physiological signals fundamentally constrains reliable cognitive load (CL) prediction in real-world settings. In safety-critical tasks such as driving, degraded signal quality can severely compromise prediction accuracy, limiting the deployment of existing models outside controlled lab conditions. To address this challenge, we propose CogMoE, a signal-quality–guided Mixture-of-Experts (MoE) framework that dynamically adapts to heterogeneous and noisy inputs. CogMoE replaces conventional modality-based fusion with a quality-aware gating mechanism that integrates EEG, ECG, EDA, and gaze according to their estimated signal quality, shifting the basis of multimodal modeling from modality identity to signal quality. The framework operates in two stages: (1) quality-aware multimodal synchronization and recovery to mitigate artifacts, temporal misalignment, and missing data, and (2) signal-quality-specific expert modeling via a cross-modal MoE transformer that regulates information flow based on signal quality. To further improve stability, we introduce CORTEX Loss, which balances task accuracy, quality-aware representation refinement and expert utilization under noise. Experiments on CL-Drive and ADABase demonstrate that CogMoE outperforms strong baselines across all modality combinations and sequence lengths, consistently delivering improvements across diverse signal-quality conditions. Our code is publicly available at https://github.com/shahaamirbader/CogMoE.
Why do neurons encode information the way they do? Normative answers to this question model neural activity as the solution to an optimisation problem; for example, the celebrated efficient coding hypothesis frames neural activity as the optimal encoding of information under efficiency constraints. Successful normative theories have varied dramatically in complexity, from simple linear models (Atick & Redlich, 1990), to complex deep neural networks (Lindsay, 2021). What complex models gain in flexibility, they lose in tractability and often understandability. Here, we split the difference by constructing a set of tractable but flexible normative representational theories. Instead of optimising the neural activities directly, following (Sengupta et al. 2018), we instead optimise the representational similarity, a matrix formed from the dot products of each pair of neural responses. Using this, we show that a large family of interesting optimisation problems are convex. This includes problems corresponding to linear and some non-linear neural networks, and problems from the literature not previously recognised as convex such as modified versions of semi-nonnegative matrix factorisation or nonnegative sparse coding. We put these findings to work in two ways. First, we extend previous results on modularity and mixed selectivity in neural activity; in so doing we provide the first necessary and sufficient identifiability result for a form of semi-nonnegative matrix factorisations. Second, we seek to understand the meaningfulness of single neural tuning curves as compared to neural representations. In particular we derive an identifiability result stating that, for an optimal representational similarity matrix, if neural tunings are `different enough' then they are uniquely linked to the optimal representational similarity, partially justifying the use of single neuron tuning analysis in neuroscience. In sum, we identify an interesting space of convex problems, and use that to derive neural coding results.
Setting up for failure: automatic discovery of the neural mechanisms of cognitive errors
Puria Radmard ⋅ Paul Bays ⋅ Máté Lengyel
Discovering the neural mechanisms underpinning cognition is one of the grand challenges of neuroscience. However, previous approaches for building models of recurrent neural network (RNN) dynamics that explain behaviour required iterative refinement of architectures and/or optimisation objectives, resulting in a piecemeal, and mostly heuristic, human-in-the-loop process. Here, we offer an alternative approach that automates the discovery of viable RNN mechanisms by explicitly training RNNs to reproduce behaviour, including the same characteristic errors and suboptimalities, that humans and animals produce in a cognitive task. Achieving this required two main innovations. First, as the amount of behavioural data that can be collected in experiments is often too limited to train RNNs, we use a non-parametric generative model of behavioural responses to produce surrogate data for training RNNs. Second, to capture all relevant statistical aspects of the data, rather than a limited number of hand-picked low-order moments as in previous moment-matching-based approaches, we developed a novel diffusion model-based approach for training RNNs. To showcase the potential of our approach, we chose a visual working memory task as our test-bed, as behaviour in this task is well known to produce response distributions that are patently multimodal (due to so-called swap errors). The resulting network dynamics correctly predicted previously reported qualitative features of neural data recorded in macaques. Importantly, these results were not possible to obtain with more traditional approaches, i.e., when only a limited set of behavioural signatures (rather than the full richness of behavioural response distributions) were fitted, or when RNNs were trained for task optimality (instead of reproducing behaviour). Our approach also yields novel predictions about the mechanism of swap errors, which can be readily tested in experiments. These results suggest that fitting RNNs to rich patterns of behaviour provides a powerful way to automatically discover the neural network dynamics supporting important cognitive functions.
A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Jingyu Lu ⋅ Haonan Wang ⋅ Qixiang ZHANG ⋅ Xiaomeng Li
Subject-agnostic brain decoding, which aims to reconstruct continuous visual experiences from fMRI without subject-specific training, holds great potential for clinical applications. However, this direction remains underexplored due to challenges in cross-subject generalization and the complex nature of brain signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a novel hierarchical decoding framework that explicitly models the ventral-dorsal architecture of the human visual system to learn multi-dimensional representations. By disentangling and leveraging features from early visual cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary cognitive information essential for visual reconstruction. Furthermore, we introduce a feature-level contrastive learning strategy to enhance the extraction of subject-invariant semantic representations, thereby enhancing subject-agnostic applicability to previously unseen subjects. Unlike conventional pipelines that need more than 12 hours of per-subject data and heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates each reconstructed video in 10 seconds without any retraining, offering a fast and clinically scalable solution.
Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
Michael Lepori ⋅ Jennifer Hu ⋅ Ishita Dasgupta ⋅ Roma Patel ⋅ Thomas Serre ⋅ Ellie Pavlick
Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality. In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants' ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.
HEEGNet: Hyperbolic Embeddings for EEG
Shanglin Li ⋅ Chu Shiwen ⋅ Okan Koc ⋅ Yi Ding ⋅ Qibin Zhao ⋅ Motoaki Kawanabe ⋅ Ziheng Chen
Electroencephalography (EEG)-based brain-computer interfaces facilitate direct communication with a computer, enabling promising applications in human-computer interactions. However, their utility is currently limited because EEG decoding often suffers from poor generalization due to distribution shifts across domains (e.g., subjects). Learning robust representations that capture underlying task-relevant information would mitigate these shifts and improve generalization. One promising approach is to exploit the underlying hierarchical structure in EEG, as recent studies suggest that hierarchical cognitive processes, such as visual processing, can be encoded in EEG. Yet, most existing decoding methods rely on Euclidean embeddings, which are not well-suited for capturing hierarchical structures. In contrast, hyperbolic spaces, regarded as the continuous analogue of tree structures, provide a natural geometry for representing hierarchical data. In this study, we first demonstrate that EEG data exhibit hyperbolicity and show that hyperbolic embeddings improve generalization. Motivated by these findings, we propose HEEGNet, a hybrid hyperbolic network architecture to capture the hierarchical structure in EEG and learn domain-invariant hyperbolic embeddings. To this end, HEEGNet combines both Euclidean and hyperbolic encoders and employs a novel coarse-to-fine domain adaptation strategy. Extensive experiments on multiple public EEG datasets, covering visual evoked potentials, emotion recognition, and intracranial EEG, demonstrate that HEEGNet achieves state-of-the-art performance.
Discovering alternative solutions beyond the simplicity bias in recurrent neural networks
William Qian ⋅ Cengiz Pehlevan
Training recurrent neural networks (RNNs) to perform neuroscience-style tasks has become a popular way to generate hypotheses for how neural circuits in the brain might perform computations. Recent work has demonstrated that task-trained RNNs possess a strong simplicity bias. In particular, this inductive bias often causes RNNs trained on the same task to collapse on effectively the same solution, typically comprised of fixed-point attractors or other low-dimensional dynamical motifs. While such solutions are readily interpretable, this collapse proves counterproductive for the sake of generating a set of genuinely unique hypotheses for how neural computations might be performed. Here we propose Iterative Neural Similarity Deflation (INSD), a simple method to break this inductive bias. By penalizing linear predictivity of neural activity produced by standard task-trained RNNs, we find an alternative class of solutions to classic neuroscience-style RNN tasks. These solutions appear distinct across a battery of analysis techniques, including representational similarity metrics, dynamical systems analysis, and the linear decodability of task-relevant variables. Moreover, these alternative solutions can sometimes achieve superior performance in difficult or out-of-distribution task regimes. Our findings underscore the importance of moving beyond the simplicity bias to uncover richer and more varied models of neural computation.
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster
Pembe Gizem Özdil ⋅ Chuanfang Ning ⋅ Jasper Phelps ⋅ Sibo Wang-Chen ⋅ Guy Elisha ⋅ Auke Ijspeert ⋅ Pavan Ramdya
Computational models are critical for advancing our understanding of how neural, biomechanical, and physical systems interact to orchestrate animal behavior. Despite the availability of near-complete reconstructions of the Drosophila melanogaster central nervous system, musculature, and exoskeleton, anatomically and physically grounded models of fly leg muscles are still missing. Such models provide an indispensable bridge between motor neuron activity and joint movements. Here, we introduce the first 3D, data-driven musculoskeletal model of Drosophila legs, implemented in both OpenSim and MuJoCo simulation environments. Our model incorporates a Hill-type muscle representation based on high-resolution X-ray scans from multiple fixed specimens. We present a pipeline for constructing muscle models from morphological imaging data and for optimizing unknown muscle parameters specific to the fly. We then combine our musculoskeletal models with detailed 3D pose estimation data from behaving flies to achieve muscle-actuated behavioral replay in OpenSim. Simulations of muscle activity across diverse walking and grooming behaviors predict coordinated muscle synergies that can be tested experimentally. Furthermore, by training imitation learning policies in MuJoCo, we examine the effect of different passive joint properties on learning speed and find that damping and stiffness facilitate learning. Overall, our model enables the investigation of motor control in an experimentally tractable model organism, providing insight into how biomechanics contributes to the generation of complex limb movements. Moreover, it can be used to control embodied artificial agents to generate naturalistic and compliant locomotion in simulated environments.
Are EEG Foundation Models Worth It? Comparative Evaluation with Traditional Decoders in Diverse BCI Tasks
Liuyin Yang ⋅ Qiang Sun ⋅ Ang Li ⋅ Marc Van Hulle
Foundation models have recently emerged as a promising approach for learning generalizable EEG representations for brain–computer interfaces (BCIs). Yet, their true advantages over traditional methods—particularly classical non-neural approaches—remain unclear. In this work, we present a comprehensive benchmark of state-of-the-art EEG foundation models, evaluated across diverse datasets, decoding tasks, and six evaluation protocols, with rigorous statistical testing. We introduce spatiotemporal EEGFormer (ST-EEGFormer), a simple yet effective Vision Transformer (ViT)-based baseline, pre-trained solely with masked autoencoding (MAE) on over 8M EEG segments. Our results show that while fine-tuned foundation models perform well in data-rich, population-level settings, they often fail to significantly outperform compact neural networks or even classical non-neural decoders in data-scarce scenarios. Furthermore, linear probing remains consistently weak, and performance varies greatly across downstream tasks, with no clear scaling law observed among neural network decoders. These findings expose a substantial gap between pre-training and downstream fine-tuning, often diminishing the benefits of complex pre-training tasks. We further identify hidden architectural factors that affect performance and emphasize the need for transparent, statistically rigorous evaluation. Overall, this study calls for community-wide efforts to construct large-scale EEG datasets and for fair, reproducible benchmarks to advance EEG foundation models.
A foundation model with multi-variate parallel attention to generate neuronal activity
Francesco Carzaniga ⋅ Michael Hersche ⋅ Abu Sebastian ⋅ Kaspar Schindler ⋅ Abbas Rahimi
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the SWEC iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art (SOTA) Transformer baselines in seizure detection across the SWEC, the MAYO, and the FNUSA datasets, while also achieving SOTA performance on four Brain TreeBank iEEG decoding tasks (volume, pitch, onset, and speech). We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds the performance of existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance. The code and weights are available at https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG dataset is available at https://huggingface.co/datasets/NeuroTec/SWECiEEGDataset.
Pitfalls in Evaluating Language Model Forecasters
Daniel Paleka ⋅ Shashwat Goel ⋅ Jonas Geiping ⋅ Florian Tramer
Large language models (LLMs) have recently been applied to forecasting tasks, with some works claiming these systems match or exceed human performance. In this paper, we argue that, as a community, we should be careful about such conclusions as evaluating LLM forecasters presents unique challenges. We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims. We argue that more rigorous evaluation methodologies are needed to confidently assess the forecasting abilities of LLMs.
Entropy-Guided Dynamic Tokens for Graph-LLM Alignment in Molecular Understanding
Zihao Jing ⋅ Qiuhao Zeng ⋅ Ruiyi Fang ⋅ Yan Sun ⋅ Boyu Wang ⋅ Pingzhao Hu
Molecular understanding is central to advancing areas such as scientific and drug discovery, yet Large Language Models (LLMs) struggle to understand molecular graphs effectively. Existing graph–LLM bridges often adapt the Q-Former-style connector with fixed-length static tokens, which is originally designed for vision tasks. These designs overlook stereochemistry and substructural context and typically require costly LLM-backbone fine-tuning, limiting efficiency and generalization. We introduce EDT-Former, an Entropy-guided Dynamic Token Transformer that generates tokens aligned with informative molecular patches, thereby preserving both local and global structural features for molecular graph understanding. Beyond prior approaches, EDT-Former enables alignment between frozen graph encoders and LLMs without tuning the LLM backbone (excluding the embedding layer), resulting in computationally efficient finetuning, and achieves state-of-the-art results on MoleculeQA, Mol-Instructions, and property prediction benchmarks (TDC, MoleculeNet), underscoring its effectiveness for scalable and generalizable multimodal molecular understanding.
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
Daniel Beaglehole ⋅ David Holzmüller ⋅ Adityanarayanan Radhakrishnan ⋅ Misha Belkin
Inference from tabular data, collections of continuous and categorical variables organized into matrices, is a foundation for modern technology and science. Yet, in contrast to the explosive changes in the rest of AI, the best practice for these predictive tasks has been relatively unchanged and is still primarily based on variations of Gradient Boosted Decision Trees (GBDTs). Very recently, there has been renewed interest in developing state-of-the-art methods for tabular data based on recent developments in neural networks and feature learning methods. In this work, we introduce xRFM, an algorithm that combines feature learning kernel machines with a tree structure to both adapt to the local structure of the data and scale to essentially unlimited amounts of training data. We show that compared to $31$ other methods, including recently introduced tabular foundation models (TabPFN-v2) and GBDTs, xRFM achieves best performance across $100$ regression datasets and is competitive to the best methods across $200$ classification datasets outperforming GBDTs. Additionally, xRFM provides interpretability natively through the Average Gradient Outer Product.
Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search
Zhiyu Mou ⋅ Yiqin Lv ⋅ Miao Xu ⋅ Qi Wang ⋅ Yixiu Mao ⋅ Jinghao Chen ⋅ Qichen Ye ⋅ Chao Li ⋅ Rongquan Bai ⋅ Chuan Yu ⋅ Jian Xu ⋅ Bo Zheng
Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose AIGB-Pearl (Planning with EvaluAtor via RL), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.
SK2Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin
hanzhuo Tan ⋅ Weihao Li ⋅ Xiaolong Tian ⋅ Siyi Wang ⋅ Jiaming Liu ⋅ Jing Li ⋅ Yuqun Zhang
Large Language Models (LLMs) have emerged as a promising approach for binary decompilation. However, the existing LLM-based decompilers still are somewhat limited in effectively presenting a program's source-level structure with its original identifiers. To mitigate this, we introduce SK2Decompile, a novel two-phase approach to decompile from the skeleton (semantic structure) to the skin (identifier) of programs. Specifically, we first apply a Structure Recovery model to translate a program's binary code to an Intermediate Representation (IR) as deriving the program's "skeleton", i.e., preserving control flow and data structures while obfuscating all identifiers with generic placeholders. We also apply reinforcement learning to reward the model for producing program structures that adhere to the syntactic and semantic rules expected by compilers. Second, we apply an Identifier Naming model to produce meaningful identifiers which reflect actual program semantics as deriving the program's "skin". We train the Identifier Naming model with a separate reinforcement learning objective that rewards the semantic similarity between its predictions and the reference code. Such a two-phase decompilation process facilitates advancing the correctness and readability of decompilation independently. Our evaluations indicate that SK2Decompile, significantly outperforms the SOTA baselines, achieving 21.6% average re-executability rate gain over GPT-5-mini on the HumanEval dataset and 29.4% average R2I improvement over Idioms on the GitHub2025 benchmark.
When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang ⋅ Chuanjie Wu ⋅ Qinggang Zhang ⋅ Shengyuan Chen ⋅ Zijin Hong ⋅ Xiao Huang ⋅ Jinsong Su
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning. Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models on both hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, covering fact retrieval, complex reasoning, contextual summarize, and creative generation, and a systematic evaluation across the entire pipeline, from graph construction and knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analysis are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity
Xiaoyang Liu ⋅ Tao Zhu ⋅ Zineng Dong ⋅ Yuntian Liu ⋅ Guo qingfeng ⋅ Liu ZhaoXuan ⋅ Yu Chen ⋅ Tao Luo
Despite significant strides in statement autoformalization, a critical gap remains in the development of automated evaluation metrics capable of assessing formal translation quality. Existing metrics often fail to balance semantic and structural information: string-based methods neglect semantics, whereas proof-based approaches offer no graded similarity when proofs fail. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which captures syntactic structure by transforming formal statements into operator trees and computes a real-valued similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric by incorporating semantic transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a benchmark comprising 1,247 expert-annotated formal statement pairs derived from miniF2F and ProofNet, distinctively labeled for both semantic provability and structural likeness. Experiments on the EPLA benchmark demonstrate that TransTED Similarity surpasses existing methods, achieving state-of-the-art accuracy and Kappa score. The benchmark dataset, code, and detailed experimental results are available at https://github.com/XiaoyangLiu-sjtu/ASSESS.
How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective
Xianzhen Luo ⋅ JinYang Huang ⋅ Wenzhen Zheng ⋅ Qingfu Zhu ⋅ Mingzheng Xu ⋅ Yiheng Xu ⋅ YuanTao Fan ⋅ Wanxiang Che
Evaluating test cases automatically generated by Large Language Models (LLMs) is a critical yet challenging task. Existing benchmarks often evaluate the exclusion ratio on large, unstructured collections of wrong codes, suffering from high computational costs and score inflation. Furthermore, they inadvertently reward generators that detect common, trivial bugs, while failing to penalize their inability to identify rare yet critical faults. In this work, we connect two fundamental questions: (1) What is the minimal set of wrong codes sufficient to represent the entire error space? and (2) What is the minimal set of test cases needed to distinguish them? We introduce a novel framework that formalizes benchmark construction as finding an optimal diagnostic basis in a binary code-test matrix, where rows represent wrong codes and columns represent test case results. The rank of this matrix specifies the minimal number of independent error patterns (wrong codes) and provides a tight upper bound on the number of test cases required for complete fault coverage. Our objective is to identify a basis of size equal to the matrix rank that maximizes internal diversity. To tackle this NP-hard problem, we propose WrongSelect, an efficient approximation algorithm to select maximally diverse wrong codes. Applying this framework to millions of competitive programming submissions, we construct TC-Bench, a compact, diverse, and inflation-resistant benchmark. Extensive experiments show that even the most advanced test case generation methods achieve only ~60\% exclusion rates on TC-Bench, exposing a significant gap in their diagnostic power and highlighting substantial room for future improvement. Our dataset is available at: https://huggingface.co/datasets/Luoberta/TC-Bench and our code is at: https://github.com/Luowaterbi/TC-Bench.
Trade in Minutes! Rationality-Driven Agentic System for Quantitative Financial Trading
Zifan Song ⋅ Kaitao Song ⋅ Guosheng Hu ⋅ Ding Qi ⋅ Junyao Gao ⋅ Xiaohua Wang ⋅ Dongsheng Li ⋅ Cai Zhao
Recent advancements in large language models (LLMs) and agentic systems have shown exceptional decision-making capabilities, revealing significant potential for autonomic finance. Current financial trading agents predominantly simulate anthropomorphic roles that inadvertently introduce emotional biases and rely on peripheral information, while being constrained by the necessity for continuous inference during deployment. In this paper, we pioneer the harmonization of strategic depth in agents with the mechanical rationality essential for quantitative trading. Consequently, we present TiMi (Trade in Minutes), a rationality-driven multi-agent system that architecturally decouples strategy development from minute-level deployment. TiMi leverages specialized LLM capabilities of semantic analysis, code programming, and mathematical reasoning within a comprehensive policy-optimization-deployment chain. Specifically, we propose a two-tier analytical paradigm from macro patterns to micro customization, layered programming design for trading bot implementation, and closed-loop optimization driven by mathematical reflection. Extensive evaluations across 200+ trading pairs in stock and cryptocurrency markets empirically validate the efficacy of TiMi in stable profitability, action efficiency, and risk control under volatile market dynamics.
Experience-based Knowledge Correction for Robust Planning in Minecraft
Seungjoon Lee ⋅ Suhwan Kim ⋅ Minhyeon Oh ⋅ Youngsik Yoon ⋅ Jungseul Ok
Large Language Model (LLM)-based planning has advanced embodied agents in long-horizon environments such as Minecraft, where acquiring latent knowledge of goal (or item) dependencies and feasible actions is critical. However, LLMs often begin with flawed priors and fail to correct them through prompting, even with feedback. We present XENON (eXpErience-based kNOwledge correctioN), an agent that algorithmically revises knowledge from experience, enabling robustness to flawed priors and sparse binary feedback. XENON integrates two mechanisms: Adaptive Dependency Graph, which corrects item dependencies using past successes, and Failure-aware Action Memory, which corrects action knowledge using past failures. Together, these components allow XENON to acquire complex dependencies despite limited guidance. Experiments across multiple Minecraft benchmarks show that XENON outperforms prior agents in both knowledge learning and long-horizon planning. Remarkably, with only a 7B open-weight LLM, XENON surpasses agents that rely on much larger proprietary models.
SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
Xuechen Zhang ⋅ Koustava Goswami ⋅ Samet Oymak ⋅ Jiasi Chen ⋅ Nedim Lipka
Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics
Shufan Jiang ⋅ Sizhou Chen ⋅ Chi Zhang ⋅ Xiao-Lei Zhang ⋅ Xuelong Li
Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing LLM-based drama generation methods often produce models that lack initiative and cannot interact with the physical scene, while typically requiring detailed user input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi-agent framework focused on drama creation and real-time online performance. Given a simple topic, the framework first generates a narrative blueprint to guide the subsequent improvisational performance. In the online performance phase, each actor is equipped with an adaptive reasoning module that enables decision-making based on their personas, memories, goals, and emotional states during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.
CoDA: Agentic Systems for Collaborative Data Visualization
Zichen Chen ⋅ Jiefeng Chen ⋅ Sercan Arik ⋅ Misha Sra ⋅ Tomas Pfister ⋅ Jinsung Yoon
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations, highlighting the need for robust automation from natural language queries. However, current systems struggle with complex datasets containing multiple files and iterative refinement. Existing approaches, including simple single- or multi-agent systems, often oversimplify the task, focusing on initial query parsing while failing to robustly manage data complexity, code errors, or final visualization quality. In this paper, we reframe this challenge as a collaborative multi-agent problem. We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection. We formalize this pipeline, demonstrating how metadata-focused analysis bypasses token limits and quality-driven refinement ensures robustness. Extensive evaluations show CoDA achieves substantial gains in the overall score, outperforming competitive baselines by up to 41.5%. This work demonstrates that the future of visualization automation lies not in isolated code generation but in integrated, collaborative agentic workflows.
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Zisheng Chen ⋅ Chunwei Wang ⋅ Runhui Huang ⋅ Hongbin Xu ⋅ Xiuwei Chen ⋅ Jun Zhou ⋅ Jianhua Han ⋅ Hang Xu ⋅ Xiaodan Liang
In this paper, we introduce SemHiTok, a unified image Tokenizer via Semantic Guided Hierarchical codebook that provides consistent discrete representations for multimodal understanding and generation. Recently, unified image tokenizers have sparked exploration within the research community, which is designed to capture high-level semantic features for understanding and retaining low-level pixel features for generation. Previous works attempt to train a unified image tokenizer by combining loss for semantic distillation and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through a novel semantic-guided hierarchical codebook, which builds pixel sub-codebooks on a pretrained semantic codebook. This design decouples the semantic and pixel in terms of structure and training strategy, enabling the tokenizer to capture pixel features while retaining its ability to comprehend high-level semantic information. Our experiments demonstrate that SemHiTok achieves leading performance in image reconstruction and multimodal understanding under the LLaVA-v1.5 setting. Further, we develop a unified MLLM with SemHiTok, which exhibits superior performance across multimodal understanding and generation tasks. Extensive experiments confirm our analysis, showing that our unified image tokenizer architecture achieves a better trade-off.
WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li ⋅ Yilun Zheng ⋅ Xinting Huang ⋅ Tianqing Fang ⋅ Jiahao Xu ⋅ Lihui Chen ⋅ Yangqiu Song ⋅ Winston Hu
The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different paradigms and guidance mechanisms. Our experiments reveal a significant gap between LLM judges and human experts. In-depth analysis indicates this gap stems from fundamental model limitations, including failures in recognizing functional equivalence, verifying task feasibility, and mitigating bias. Overall, WebDevJudge presents a significant challenge to LLM-as-a-judge, offering insights to guide future research toward developing more reliable and capable automated evaluators for complicated scenarios.
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
Shreyas Singh ⋅ Kunal Singh ⋅ Pradeep Moturi
Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a ∼5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while closely rivaling proprietary closed systems, while also demonstrating strong performance in general reasoning benchmarks: HLE, AIME-25, GPQA-Diamond, and MedQA.
Lean Finder: Semantic Search for Mathlib That Understands User Intents
Jialin Lu ⋅ Kye Emond ⋅ Kaiyu Yang ⋅ Swarat Chaudhuri ⋅ Weiran Sun ⋅ Wuyang Chen
We present Lean Finder, a semantic search engine for Lean and mathlib that understands and aligns with the intents of mathematicians. Progress in formal theorem proving is often hindered by the difficulty of locating relevant theorems and the steep learning curve of the Lean 4 language, making advancement slow and labor-intensive. Existing Lean search engines, though helpful, rely primarily on informalizations (natural language translation of the formal statements), while largely overlooking the mismatch with real-world user queries. In contrast, we propose a user-centered semantic search tailored to the needs of mathematicians. Our approach begins by analyzing and clustering the semantics of public Lean discussions, then fine-tuning text embeddings on synthesized queries that emulate user intents. We further align Lean Finder with mathematicians’ preferences using diverse feedback signals, encoding it with a rich awareness of their goals from multiple perspectives. Evaluations on real-world queries, informalized statements, and proof states demonstrate that our Lean Finder achieves over 30% relative improvement compared to previous search engines and GPT-4o. In addition, Lean Finder is compatible with LLM-based theorem provers, bridging retrieval with formal reasoning. Lean Finder is available at: https://leanfinder.github.io.
Control Tax: The Price of Keeping AI in Check
Mikhail Terekhov ⋅ Zhen Liu ⋅ Caglar Gulcehre ⋅ Samuel Albanie
The rapid integration of agentic AI into high-stakes real-world applications requires robust oversight mechanisms. The emerging field of AI Control (AIC) aims to provide such an oversight mechanism, but practical adoption depends heavily on implementation overhead. To study this problem better, we introduce the notion of Control tax---the operational and financial cost of integrating control measures into AI pipelines. Our work makes three key contributions to the field of AIC: (1) we introduce a theoretical framework that quantifies the Control Tax and maps classifier performance to safety assurances; (2) we conduct comprehensive evaluations of state-of-the-art language models in adversarial settings, where attacker models insert subtle backdoors into code while monitoring models attempt to detect these vulnerabilities; and (3) we provide empirical financial cost estimates for control protocols and develop optimized monitoring strategies that balance safety and cost-effectiveness while accounting for practical constraints like auditing budgets. Our framework enables practitioners to make informed decisions by systematically connecting safety guarantees with their costs, advancing AIC through principled economic feasibility assessment across different deployment contexts.
Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval
Junwei Lan ⋅ Jianlyu Chen ⋅ Zheng Liu ⋅ Chaofan Li ⋅ Siqi Bao ⋅ Defu Lian
With the growing popularity of LLM agents and RAG, it has become increasingly important to retrieve documents that are essential for solving a task, even when their connection to the task is indirect or implicit. Addressing this problem requires fine-grained reasoning to accurately assess the relevance between the task and each candidate document. This capability, however, poses a significant challenge for existing IR techniques. Despite recent progress in reasoning-enhanced IR, existing approaches still face significant challenges in applicability, scalability, and efficiency. In this work, we propose Retro*, a novel approach for reasoning-intensive document retrieval. Our method introduces a rubric-based relevance scoring mechanism, enabling the model to reason about the relationship between a task and a document based on explicitly defined criteria, whereby producing a fine-grained, interpretable relevance score. Retro* also supports test-time scaling by combining multiple reasoning trajectories via score integration, which produces more reliable relevance estimates. To optimize Retro*'s reasoning capabilities, we introduce a novel reinforcement learning algorithm tailored for its relevance scoring mechanism, which employs two composite rewards to fully exploit the trajectories of each training sample. Our experiments show that Retro* outperforms existing document retrieval methods with notable advantages, leading to state-of-the-art performance on the BRIGHT benchmark.
LoC-Decomp: LLM Autoformalization via Logical Concept Decomposition and Iterative Feedback Correction
jiangze shi ⋅ Zhiwei Zhang ⋅ Baoquan Ma ⋅ Shuai Zhao ⋅ Ye Yuan ⋅ Guoren Wang
Autoformalization—the process of converting natural language mathematical statements into machine-verifiable formal code—plays a critical role in ensuring the reliability of mathematical reasoning generated by large language models (LLMs). Recent studies show that LLMs exhibit strong potential in automating this process, producing formal code for systems such as Lean 4, Coq, and Isabelle. Despite prominent advances, existing LLM-based autoformalization methods remain limited: they lack the ability to provide reliable semantic consistency checks to ensure that the formal code accurately preserves the meaning of the original statement. Furthermore, such methods are unable to support iterative improvement through corrective feedback. To address these limitations, we propose Loc-Decomp, a novel framework that integrates an automatic semantic consistency checker and the Lean 4 compiler to iteratively refine LLM-generated formalizations, ensuring both semantic consistency and syntactic correctness. Our approach introduces three key innovations: (1) A structured and COT-like formalization template that decomposes complex formalization tasks into modular, foundational components, and systematically assembles them—like building blocks—into a complete formal expression. (2) A semantic self-checking mechanism based on a divide-conquer-merge strategy to detect subtle inconsistencies between the formalization and the original statement. (3) An iterative feedback-driven refinement loop that leverages both semantic and syntactic error signals to guide the LLM in progressively improving the formal output. By integrating these innovations, Loc-Decomp significantly enhances the accuracy of LLM-driven formalization, reduces reliance on human intervention, and moves closer to truly reliable automated reasoning. Extensive experiments on high-school-level and undergraduate-level datasets demonstrate that our approach achieves a significantly higher formalization success rate compared to baseline methods and state-of-the-art (SOTA) models. On the PutnamBench dataset, for instance, our method attains a success rate of 93.09\%, representing an improvement of 18 percentage points over the previous SOTA SFT-based model.
Persona Features Control Emergent Misalignment
Miles Wang ⋅ Tom Dupre la Tour ⋅ Olivia Watkins ⋅ Aleksandar Makelov ⋅ Ryan Chi ⋅ Samuel Miserendino ⋅ Jeffrey Wang ⋅ Achyuta Rajaram ⋅ Johannes Heidecke ⋅ Tejal Patwardhan ⋅ Daniel Mossing
Understanding how language models generalize behaviors from their training to a broader deployment distribution is an important problem in AI safety. Betley et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes "emergent misalignment," where models give stereotypically malicious responses to unrelated prompts. We extend this work, demonstrating emergent misalignment across diverse conditions, including reinforcement learning on reasoning models, fine-tuning on various synthetic datasets, and in models without safety training. To investigate the mechanisms behind this generalized misalignment, we apply a "model diffing" approach using sparse autoencoders to compare internal model representations before and after fine-tuning. This approach reveals several "misaligned persona" features in activation space, including a toxic persona feature which most strongly controls emergent misalignment and can be used to predict whether a model will exhibit such behavior. Additionally, we investigate mitigation strategies, discovering that fine-tuning an emergently misaligned model on just a few hundred benign samples efficiently restores alignment.
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
Shaokun Zhang ⋅ Yi Dong ⋅ Jieyu Zhang ⋅ Jan Kautz ⋅ Bryan Catanzaro ⋅ Andrew Tao ⋅ Qingyun Wu ⋅ Zhiding Yu ⋅ Guilin Liu
Enabling large language models with external tools has become a pivotal strategy for extending their functionality beyond text space. To enhance LLMs' tool-calling abilities, previous approaches primarily rely on supervised fine-tuning (SFT) with trajectories distilled from stronger models, often resulting in imitative reasoning that limits generalization. In this work, we explore rule-based reinforcement learning to enhance tool-calling in LLMs, resulting in Nemotron-Research-Tool-N1, a series of tool-calling reasoning models. Rather than enforcing supervision over intermediate distilled reasoning traces, Tool-N1 is trained with a binary RL reward that assesses only the format validity and functional correctness of tool invocations. This lightweight supervision allows the model to develop reasoning strategies independently, without relying on annotated trajectories. Experiments on several major benchmarks show that Tool-N1-7B/14B clearly outperform GPT-4o. We conduct a systematic study on the design of rule-based reinforcement learning strategies for training tool-calling models. Using 5,518 distilled reasoning trajectories, we compare SFT, RL, and the SFT-then-RL pipeline, finding that the widely adopted SFT-then-RL paradigm does not necessarily outperform pure RL.
Reducing Semantic Mismatch in Brain-to-Text Decoding Through Personalized Multimodal Masking
Jiaxuan Chen ⋅ Yu Qi ⋅ Yueming Wang ⋅ Gang Pan
The rapid progress of large vision-language models (VLMs), such as CLIP, has spurred the development of a wide range of neural decoding frameworks. Nevertheless, most existing approaches still suffer from semantic mismatches during representational alignment. This challenge may stem from the fact that the human brain does not distribute attention uniformly across a visual scene, but rather selectively encodes salient or relevant regions. Moreover, such selectivity is closely related to individual interests and varies from person to person. To address this challenge, we propose Yo'Mind, a novel optimal transport (OT)-driven personalized multimodal semantic masking framework designed to bridge the semantic gap between brain and machines in interpreting visual scenes. Technically, Yo'Mind introduces a dynamic semantic pruning and allocation mechanism that adaptively masks redundant visual semantic components in stimulus images based on individual neural responses—without requiring extra human supervision or hyperparameter tuning. This strategy can be used to enhance semantic consensus between brain and machine representations during decoding. Furthermore, the inherent flexibility of OT theory enables Yo'Mind to perform brain-visual-linguistic alignment and cross-subject decoding within a unified end-to-end architecture. Extensive experiments demonstrate that our Yo'Mind offers several advantages, including state-of-the-art brain-to-text reconstruction performance and improved interpretability of the decoding process.
Reconciling Visual Perception and Generation in Diffusion Models
Liulei Li ⋅ Yi Yang ⋅ Wenguan Wang
We present \textsc{GenRep}, a unified image understanding and synthesis model that jointly conducts discriminative learning and generative modeling in one training session. By leveraging Monte Carlo approximation, \textsc{GenRep} distills distributional knowledge embedded in diffusion models to guide the discriminative learning for visual perception tasks. Simultaneously, a semantic-driven image generation process is established, where high-level semantics learned from perception tasks can be used to inform image synthesis, creating a positive feedback loop for mutual boosts. Moreover, to reconcile the learning process for both tasks, a gradient alignment strategy is proposed to symmetrically modify the optimization directions of perception and generation losses. These designs empower \textsc{GenRep} to be a versatile and powerful model that achieves top-leading performance on both image understanding and generation benchmarks.
SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
Xiaoxuan He ⋅ Siming Fu ⋅ Wanli Li ⋅ Zhiyuan Li ⋅ Dacheng Yin ⋅ Kang Rong ⋅ Fengyun Rao ⋅ Bo Zhang
Aligning diffusion models with human preferences remains challenging, particularly when reward models are unavailable or impractical to obtain, and collecting large-scale preference datasets is prohibitively expensive. This raises a fundamental question: can we achieve effective alignment using only minimal human feedback, without auxiliary reward models, by unlocking the latent capabilities within diffusion models themselves? In this paper, we propose SAIL (Self-Amplified Iterative Learning), a novel framework that enables diffusion models to act as their own teachers through iterative self-improvement. Starting from a minimal seed set of human-annotated preference pairs, SAIL operates in a closed-loop manner where the model progressively generates diverse samples, self-annotates preferences based on its evolving understanding, and refines itself using this self-augmented dataset. To ensure robust learning and prevent catastrophic forgetting, we introduce a ranked preference mixup strategy that carefully balances exploration with adherence to initial human priors. Extensive experiments demonstrate that SAIL consistently outperforms state-of-the-art methods across multiple benchmarks while using merely 6\% of the preference data required by existing approaches, revealing that diffusion models possess remarkable self-improvement capabilities that, when properly harnessed, can effectively replace both large-scale human annotation and external reward models.
Activation Steering with a Feedback Controller
Dung Viet Nguyen ⋅ Yen Pham ⋅ Hieu Vu ⋅ Lei Zhang ⋅ Tan Nguyen
Controlling the behaviors of large language models (LLMs) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
Quadratic Direct Forecast for Training Multi-Step Time-Series Forecast Models
Hao Wang ⋅ Licheng Pan ⋅ Yuan Lu ⋅ Zi Chan ⋅ Tianqiao Liu ⋅ Shuting He ⋅ Zhixuan Chu ⋅ Qingsong Wen ⋅ Haoxuan Li ⋅ Zhouchen Lin
The design of training objective is central to training time-series forecasting models. Existing training objectives such as mean squared error mostly treat each future step as an independent, equally weighted task, which we found leading to the following two issues: (1) overlook the label autocorrelation effect among future steps, leading to biased training objective; (2) fail to set heterogeneous task weights for different forecasting tasks corresponding to varying future steps, limiting the forecasting performance. To fill this gap, we propose a novel quadratic-form weighted training objective, addressing both of the issues simultaneously. Specifically, the off-diagonal elements of the weighting matrix account for the label autocorrelation effect, whereas the non-uniform diagonals are expected to match the most preferable weights of the forecasting tasks with varying future steps. To achieve this, we propose a Quadratic Direct Forecast (QDF) learning algorithm, which trains the forecast model using the adaptively updated quadratic-form weighting matrix. Experiments show that our QDF effectively improves performance of various forecast models, achieving state-of-the-art results. Code is available at https://anonymous.4open.science/r/QDF-8937.
Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems
Ori Shem-ur ⋅ Khen Cohen ⋅ Yaron Oz
Deep learning models, such as wide neural networks, can be viewed as nonlinear dynamical systems composed of numerous interacting degrees of freedom. When such systems approach the limit of infinite number of degrees of freedom, their dynamics tend to simplify. This paper investigates gradient descent-based learning algorithms that exhibit linearization in their parameters. We establish that this apparent linearity, arises from weak correlations between the first, and higher-order derivatives of the hypothesis function with respect to the parameters, at initialization. Our findings indicate that these weak correlations fundamentally underpin the observed linearization phenomenon of wide neural networks. Leveraging this connection, we derive bounds on the deviation from linearity during stochastic gradient descent training. To support our analysis, we introduce a novel technique for characterizing the asymptotic behavior of random tensors. We validate our theoretical insights through empirical studies, comparing the linearized dynamics to the observed correlations.
Nesterov Finds GRAAL: Optimal and Adaptive Gradient Method for Convex Optimization
Ekaterina Borodich ⋅ Dmitry Kovalev
In this paper, we focus on the problem of minimizing a continuously differentiable convex objective function, $\min_x f(x)$. Recently, Malitsky (2020); Alacaoglu et al. (2023) developed an adaptive first-order method, GRAAL. This algorithm computes stepsizes by estimating the local curvature of the objective function without any line search procedures or hyperparameter tuning, and attains the standard iteration complexity $\mathcal{O}(L\Vert x_0-x^* \Vert^2/\epsilon)$ of fixed-stepsize gradient descent for $L$-smooth functions. However, a natural question arises: is it possible to accelerate the convergence of GRAAL to match the optimal complexity $\mathcal{O}(\sqrt{L\Vert x_0-x^*\Vert^2/\epsilon})$ of the accelerated gradient descent of Nesterov (1983)? Although some attempts have been made by Li and Lan (2025); Suh and Ma (2025), the ability of existing accelerated algorithms to adapt to the local curvature of the objective function is highly limited. We resolve this issue and develop GRAAL with Nesterov acceleration, which can adapt its stepsize to the local curvature at a geometric, or linear, rate just like non-accelerated GRAAL. We demonstrate the adaptive capabilities of our algorithm by proving that it achieves near-optimal iteration complexities for $L$-smooth functions, as well as under a more general $(L_0,L_1)$-smoothness assumption (Zhang et al., 2019).
What Matters for Batch Online Reinforcement Learning in Robotics?
Perry Dong ⋅ Suvir Mirchandani ⋅ Dorsa Sadigh ⋅ Chelsea Finn
The ability to learn from large batches of autonomously collected data for policy improvement---a paradigm we refer to as batch online reinforcement learning---holds the promise of enabling truly scalable robot learning by significantly reducing the need for human effort of data collection while getting benefits from self-improvement. Yet, despite the promise of this paradigm, it remains challenging to achieve due to algorithms not being able to learn effectively from the autonomous data. For example, prior works have applied imitation learning and filtered imitation learning methods to the batch online RL problem, but these algorithms often fail to efficiently improve from the autonomously collected data or converge quickly to a suboptimal point. This raises the question of what matters for effective batch online reinforcement learning in robotics. Motivated by this question, we perform a systematic empirical study of three axes---(i) algorithm class, (ii) policy extraction methods, and (iii) policy expressivity---and analyze how these axes affect performance and scaling with the amount of autonomously collected data. Through our analysis, we make several observations. First, we observe that the use of Q-functions to guide batch online RL significantly improves performance over imitation-based methods. Building on this, we show that an implicit method of policy extraction---via choosing the best action in the distribution of the policy---is necessary over traditional explicit policy extraction methods from offline RL. Next, we show that an expressive policy class is preferred over less expressive policy classes. Based on this analysis, we propose a general recipe for effective batch online RL. We then show a simple addition to the recipe, namely using temporally-correlated noise to obtain more diversity, results in further performance gains. Our recipe obtains significantly better performance and scaling compared to prior methods.
Visual Jigsaw Post-Training Improves MLLMs
Penghao Wu ⋅ Yushan Zhang ⋅ Haiwen Diao ⋅ Bo Li ⋅ Lewei Lu ⋅ Ziwei Liu
Reinforcement learning based post-training has recently emerged as a powerful paradigm for enhancing the alignment and reasoning capabilities of multimodal large language models (MLLMs). While vision-centric post-training is crucial for enhancing MLLMs’ intrinsic understanding of visual signals, current post-training paradigms are predominantly text-centric, where dense visual inputs are only leveraged to extract sparse cues for text-based reasoning. There exist a few approaches in this direction, however, they often still rely on text as an intermediate mediator or introduce additional visual generative designs. In this work, we introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in MLLMs. Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language. This naturally aligns with reinforcement learning from verifiable rewards (RLVR), requires no additional visual generative components, and derives its supervisory signal automatically without any annotations. We instantiate Visual Jigsaw across three visual modalities, including images, videos, and 3D data. Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding. Our findings highlight the potential of self-supervised vision-centric tasks in post-training MLLMs and aim to inspire further research on vision-centric pretext designs.
RAPID$^3$: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer
Wangbo Zhao ⋅ Yizeng Han ⋅ Zhiwei Tang ⋅ Jiasheng Tang ⋅ Pengfei Zhou ⋅ Kai Wang ⋅ Bohan Zhuang ⋅ Zhangyang Wang ⋅ Fan Wang ⋅ Yang You
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators—step reduction, feature caching, and sparse attention—enhance inference speed but typically rely on a uniform heuristic or manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads—Step-Skip, Cache-Reuse, and Sparse-Attention—observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model’s distribution. Across state-of-the-art DiT backbones including Stable Diffusion 3 and FLUX, RAPID^3 achieves nearly 3$\times$ faster sampling with competitive generation quality.
OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Alana Renda ⋅ Jillian Ross ⋅ Jacob Andreas
Real-world settings where language models (LMs) are deployed --- in domains spanning healthcare, finance, and other forms of knowledge work --- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce \textsc{OpenEstimate}, an extensible, multi-domain benchmark for evaluating LMs on probabilistic estimation tasks that require models to synthesize knowledge from pretraining and express predictions as Bayesian priors. We assess these priors for accuracy and calibration. Across six frontier models, we find that LM-elicited priors are worth the equivalent of about five samples from the underlying data distribution, and that posteriors computed using LM priors tend to be more accurate than those computed using a naive prior. At the same time, the relationship between model accuracy and confidence is weak across the board, indicating the value of developing new methods to improve calibration. The \textsc{OpenEstimate} benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
Predicting Kernel Regression Learning Curves from Only Raw Data Statistics
Dhruva Karkada ⋅ Joseph Turnbull ⋅ Yuxi Liu ⋅ James Simon
We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$. The key new idea is an analytical approximation of a kernel’s eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the \textit{Hermite eigenstructure ansatz} (HEA). We prove the HEA for Gaussian data, but we find that real image data is often ``Gaussian enough’’ for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.
LumosX: Relate Any Identities with Their Attributes for Personalized Video Generation
Jiazheng Xing ⋅ Fei Du ⋅ Hangjie Yuan ⋅ Pengwei Liu ⋅ Hongbin Xu ⋅ Hai Ci ⋅ Ruigang Niu ⋅ Weihua Chen ⋅ Fan Wang ⋅ Yong Liu
Recent advances in diffusion models have significantly improved text-to-video generation, enabling personalized content creation with fine-grained control over both foreground and background elements. However, precise face–attribute alignment across subjects remains challenging, as existing methods lack explicit mechanisms to ensure intra-group consistency. Addressing this gap requires both explicit modeling strategies and face-attribute-aware data resources. We therefore propose LumosX, a framework that advances both data and model design. On the data side, a tailored collection pipeline orchestrates captions and visual cues from independent videos, while multimodal large language models (MLLMs) infer and assign subject-specific dependencies. These extracted relational priors impose a finer-grained structure that amplifies the expressive control of personalized video generation and enables the construction of a comprehensive benchmark. On the modeling side, Relational Self-Attention and Relational Cross-Attention intertwine position-aware embeddings with refined attention dynamics to inscribe explicit subject–attribute dependencies, enforcing disciplined intra-group cohesion and amplifying the separation between distinct subject clusters. Comprehensive evaluations on our benchmark demonstrate that LumosX achieves state-of-the-art performance in fine-grained, identity-consistent, and semantically aligned personalized multi-subject video generation.
SuperMAN: Interpretable and Expressive Networks over Temporally Sparse Heterogeneous Data
Andrea Zerio ⋅ Maya Bechler-Speicher ⋅ Maor Huri ⋅ Marie Vestergaard ⋅ Tine Jess ⋅ Ran Gilad-Bachrach ⋅ Samir Bhatt ⋅ Aleksejs Sazonovs
Real-world temporal data often consists of multiple signal types recorded at irregular, asynchronous intervals. For instance, in the medical domain, different types of blood tests can be measured at different times and frequencies, resulting in fragmented and unevenly scattered temporal data. Similar issues of irregular sampling occur in other domains, such as the monitoring of large systems using event log files. Effectively learning from such data requires handling sets of temporal sparse and heterogeneous signals. In this work, we propose Super Mixing Additive Networks (SuperMAN), a novel and interpretable-by-design framework for learning directly from such heterogeneous signals, by modeling them as sets of implicit graphs. SuperMAN provides diverse interpretability capabilities, including node-level, graph-level, and subset-level importance, and enables practitioners to trade finer-grained interpretability for greater expressivity when domain priors are available. SuperMAN achieves state-of-the-art performance in real-world high-stakes tasks, including predicting Crohn’s disease onset and hospital length of stay from routine blood test measurements and detecting fake news. Furthermore, we demonstrate how SuperMAN’s interpretability properties assist in revealing disease development phase transitions and provide crucial insights in the healthcare domain.
Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute
Kieran Didi ⋅ Zuobai Zhang ⋅ Guoqing Zhou ⋅ Danny Reidenbach ⋅ Zhonglin Cao ⋅ Sooyoung Cha ⋅ Tomas Geffner ⋅ Christian Dallago ⋅ Jian Tang ⋅ Michael Bronstein ⋅ Martin Steinegger ⋅ Emine Kucukbenli ⋅ Arash Vahdat ⋅ Karsten Kreis
Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors (``hallucination''). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang ⋅ Shengran Hu ⋅ Cong Lu ⋅ Robert Lange ⋅ Jeff Clune
Most of today's AI systems are constrained by human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The scientific method, on the other hand, is a cumulative and open-ended system, where each innovation builds upon previous artifacts, enabling future discoveries. There is growing hope that the current manual process of advancing AI could itself be automated. If done safely, such automation would accelerate AI development and allow us to reap its benefits much sooner. This prospect raises the question of how AI systems can endlessly improve themselves while getting better at solving relevant problems. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a novel self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM grows an archive of generated coding agents. It samples agents from this archive, which self-modify to create new, interesting versions of themselves. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). Overall, the DGM represents a significant step toward self-improving AI, capable of gathering its own stepping stones along a path that unfolds into endless innovation.
RMFlow: Refined Mean Flow by a Noise-Injection Step for Multimodal Generation
Yuhao Huang ⋅ Shih-Hsin Wang ⋅ Andrea Bertozzi ⋅ Bao Wang
Mean flow (MeanFlow) enables efficient, high-fidelity image generation, yet its single-function evaluation (1-NFE) generation often cannot yield compelling results. We address this issue by introducing RMFlow, an efficient multimodal generative model that integrates a coarse 1-NFE MeanFlow transport with a subsequent tailored noise-injection refinement step. RMFlow approximates the average velocity of the flow path using a neural network trained with a new loss function that balances minimizing the Wasserstein distance between probability paths and maximizing sample likelihood. RMFlow achieves near state-of-the-art results on text-to-image, context-to-molecule, and time-series generation using only 1-NFE, at a computational cost comparable to the baseline MeanFlows.
Larger Datasets Can Be Repeated More: A Theoretical Analysis of Multi-Epoch Scaling in Linear Regression
Tingkai Yan ⋅ Haodong Wen ⋅ Binghui Li ⋅ Kairong Luo ⋅ Wenguang Chen ⋅ Kaifeng Lyu
While data scaling laws of large language models (LLMs) have been widely examined in the one-pass regime with massive corpora, their form under limited data and repeated epochs remains largely unexplored. This paper presents a theoretical analysis of how a common workaround, training for multiple epochs on the same dataset, reshapes the data scaling laws in linear regression. Concretely, we ask: to match the performance of training on a dataset of size $N$ for $K$ epochs, how much larger must a dataset be if the model is trained for only one pass? We quantify this using the $\textit{effective reuse rate}$ of the data, $E(K, N)$, which we define as the multiplicative factor by which the dataset must grow under one-pass training to achieve the same test loss as $K$-epoch training. Our analysis precisely characterizes the scaling behavior of $E(K, N)$ for SGD in linear regression under either strong convexity or Zipf-distributed data: (1) When $K$ is small, we prove that $E(K, N) \approx K$, indicating that every new epoch yields a linear gain; (2) As $K$ increases, $E(K, N)$ plateaus at a problem-dependent value that grows with $N$ ($\Theta(\log N)$ for the strongly-convex case), implying that larger datasets can be repeated more times before the marginal benefit vanishes. These theoretical findings point out a neglected factor in a recent empirical study by [Muennighoff et al. (2023)](https://arxiv.org/abs/2305.16264), which claimed that training LLMs for up to $4$ epochs results in negligible loss differences compared to using fresh data at each step, $\textit{i.e.}$, $E(K, N) \approx K$ for $K \le 4$ in our notation. Supported by further empirical validation with LLMs, our results reveal that the maximum $K$ value for which $E(K, N) \approx K$ in fact depends on the data size and distribution, and underscore the need to explicitly model both factors in future studies of scaling laws with data reuse.
Uni-NTFM: A Unified Foundation Model for EEG Signal Representation Learning
Zhisheng Chen ⋅ Yingwei Zhang ⋅ Qizhen Lan ⋅ Tianyu Liu ⋅ Huacan Wang ⋅ Yi Ding ⋅ Ziyu Jia ⋅ Ronghao Chen ⋅ Kun Wang ⋅ XINLIANG ZHOU
Current foundation models for electroencephalography (EEG) rely on architectures adapted from computer vision or natural language processing, typically treating neural signals as pixel grids or token sequences. This approach overlooks that the neural activity is activated by diverse sparse coding across a complex geometric topological cortex. Inspired by biological neural mechanisms, we propose the Unified Neural Topological Foundation Model (Uni-NTFM), an architecture rooted in three core neuroscience principles. In detail, to align with the brain's decoupled coding mechanism, we design the Heterogeneous Feature Projection Module. This module simultaneously encodes both time-domain non-stationary transients and frequency-domain steady-state rhythms, ensuring high quality in both waveform morphology and spectral rhythms. Moreover, we introduce a Topological Embedding mechanism to inject structured spatial priors and align different sensor configurations onto a unified latent functional topography, effectively reconstructing the geometry of brain regions. Furthermore, we achieve functional modularization and sparse coding efficiency of biological networks by constructing the Mixture-of-Experts Transformer network. This dynamic routing mechanism assigns different signal patterns and tasks to specialized neural subnetworks, and effectively preventing task interference while increasing the model capacity to record-breaking 1.9 billion parameters. Uni-NTFM is pre-trained on a diverse corpus comprising 28,000 hours of EEG data, and outperforms existing models across nine distinct downstream tasks under both linear probing and fine-tuning settings, demonstrating that aligning model architecture with neural mechanisms is significant to learn universal representations and achieve generalizable brain decoding.} Our code is available at \url{https://anonymous.4open.science/r/Uni-NTFM-0924}
Reinforcement Learning for Machine Learning Engineering Agents
Sherry Yang ⋅ Joy He-Yueya ⋅ Percy Liang
Machine learning engineering (MLE) has a clear objective: Given an MLE task and a verifier (e.g., performance on some held-out data), what is the most effective way to utilize compute to achieve the best performance for the given task? Existing language model (LM) agents rely on prompting frontier LMs and accumulating experience non-parametrically by storing and retrieving experience through agent scaffolds and test-time compute. In this paper, we show that in environments such as MLE where a good verifier is available, adapting the LM parameters through gradient updates can be more effective in utilizing compute and agent’s experience. Specifically, we show that agents backed by weaker models that improve via reinforcement learning (RL) can eventually outperform agents backed by much larger, but static models for a given MLE task. We identify two major challenges with RL in this setting. First, actions can take a variable amount of time (e.g., executing code for different solutions), which leads to asynchronous policy gradient updates that favor faster but suboptimal solutions. We propose duration-aware gradient updates in a distributed asynchronous RL framework to amplify high-cost but high-reward actions. Second, using performance on the held-out data as a reward for MLE provides limited feedback. A program that’s nearly correct is treated the same as one that fails entirely (e.g., during data loading). We propose environment instrumentation to offer verifiable partial credit, using a separate, static language model to insert print statement to an existing program. Our experiments suggest that a small LM (Qwen2.5-3B) adapted with RL, when given enough compute, can solve an MLE task better than prompting a frontier model (Claude-3.5-Sonnet) with the state-of-the-art agent scaffold (AIDE) by an average of 22% across 12 Kaggle tasks.
MuonBP: Faster Muon via Block-Periodic Orthogonalization
Ahmed Khaled ⋅ Kaan Ozkara ⋅ Tao Yu ⋅ Mingyi Hong ⋅ Youngsuk Park
Gradient orthogonalization is a simple strategy that shows great utility in speeding up gradient descent. The Muon optimizer (Keller et al., 2024) combines gradient orthogonalization with first-order momentum and achieves significant improvement in data efficiency over Adam/AdamW (Loshchilov & Hutter, 2019) for language model training. However, when using model parallelism, gradient orthogonalization introduces additional overhead compared to coordinate-wise optimizers (such as AdamW) due to additional gather and scatter operations on gradient matrix shards from different devices. This additional communication can amount to a throughput hit of 5%-10% compared to Adam/AdamW. To remedy this, we propose Muon with Block-Periodic Orthogonalization (MuonBP), which applies orthogonalization independently to matrix shards on each device and periodically performs full orthogonalization to maintain training stability at scale. We show how to adjust the learning rate from the baseline to MuonBP and give convergence guarantees for this algorithm. Crucially, our theory dictates that we use two stepsizes: one for the blockwise orthogonalization steps, and one for the full orthogonalization steps. Our method is simple, requires minimal hyperparameter adjustments, and achieves competitive iteration complexity compared with baseline Muon while providing per-iteration throughput comparable to coordinate-wise methods such as AdamW. When training an 8B model with eight-way tensor parallelism and ZeRO optimizer state sharding, MuonBP achieves 8% throughput increase compared to Muon with no degradation in performance.
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu ⋅ Yunzhuo Hao ⋅ Huichen Wang ⋅ Linjie Li ⋅ Michael Qizhe Shieh ⋅ Yejin Choi ⋅ Ranjay Krishna ⋅ Yu Cheng
Multimodal reasoning requires iterative coordination between language and vision, yet it remains unclear what constitutes a meaningful interleaved chain of thought. We posit that text and image thoughts should function as complementary, rather than isomorphic, modalities that mutually advance reasoning. Guided by this principle, we build ThinkMorph, a unified model fine-tuned on $\sim$24K high-quality interleaved reasoning traces spanning tasks with varying visual engagement. ThinkMorph learns to generate progressive text–image reasoning steps that concretely manipulate visual content while maintaining coherent verbal logic. It delivers large gains on vision-centric benchmarks (averaging 34.7% over the base model) and generalizes to out-of-domain tasks, matching or surpassing larger and proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal intelligence, including unseen visual manipulation skills, adaptive switching between reasoning modes, and better test-time scaling through diversified multimodal thoughts. These findings suggest promising directions for characterizing the emergent capabilities of unified models for multimodal reasoning.
CircuitNet 3.0: A Multi-Modal Dataset with Task-Oriented Augmentation for AI-Driven Circuit Design
Mingjun Wang ⋅ Yihan Wen ⋅ Yuntao Lu ⋅ Fengrui Liu ⋅ Yuxiang Zhao ⋅ Boyu Han ⋅ Jianan Mu ⋅ Yibo Lin ⋅ Runsheng Wang ⋅ Bei Yu ⋅ Huawei Li
Integrated circuit (IC) designs require transforming high-level specifications into physical layouts, demanding extensive expertise and specialized tools, as well as months of time and numerous iterations. While machine learning (ML) has shown promise in various research domains, the lack of large-scale, open datasets limits its application in chip design. To address this limitation, we introduce CircuitNet 3.0, a large-scale, comprehensive, and open-source dataset curated to facilitate the evaluation of ML models on challenging timing and power prediction tasks. Starting with a diverse set of 8,659 validated open-source designs, we employ a systematic framework to generate over 15,000 instances. Through specialized syntax-tree mutation strategies and principled, task-oriented filtering methodology, we enrich each design with multi-modal information spanning multiple design stages, including complete design flow documentation, register-transfer-level (RTL) designs and corresponding netlists, detailed physical layouts, and comprehensive performance metrics. The experimental results demonstrate that ML models leveraging the enriched multi-stage, multi-modal circuit representations significantly improve performance over existing open-source datasets in electronic design automation (EDA) tasks, paving the way for efficient and accessible circuit representation learning. The dataset and codes are available in \url{https://github.com/sklp-eda-lab/iclr-circuitnet_3.0/}.
Natural Language PDDL (NL-PDDL) for Open-world Goal-oriented Commonsense Regression Planning in Embodied AI
Xiaotian Liu ⋅ Armin Toroghi ⋅ Jiazhou Liang ⋅ David Courtis ⋅ Ruiwen Li ⋅ Ali Pesaranghader ⋅ Jaehong Kim ⋅ Tanmana Sadhu ⋅ Hyejeong Jeon ⋅ Scott Sanner
Planning in open-world environments, where agents must act with partially observed states and incomplete knowledge, is a central challenge in embodied AI. Open-world planning involves not only sequencing actions but also determining what information the agent needs to sense to enable those actions. Existing approaches using Large Language Models (LLM) and Vision-Language Models (VLM) cannot reliably plan over long horizons and complex goals, where they often hallucinate and fail to reason causally over agent-environment interactions. Alternatively, classical PDDL planners offer correct and principled reasoning, but fail in open-world settings: they presuppose complete models and depend on exhaustive grounding over all objects, states, and actions; they cannot address misalignment between goal specifications (e.g., “heat the bread”) and action specifications (e.g., “toast the bread”); and they do not generalize across modalities (e.g., text, vision). To address these core challenges: (i) we extend symbolic PDDL into a flexible natural language representation that we term NL-PDDL, improving accessibility for non-expert users as well as generalization over modalities; (ii) we generalize regression-style planning to NL-PDDL with commonsense entailment reasoning to determine what needs to be observed for goal achievement in partially-observed environments with potential goal–action specification misalignment; and (iii) we leverage the lifted specification of NL-PDDL to facilitate open-world planning that avoids exhaustive grounding and yields a time and space complexity independent of the number of ground objects, states, and actions. Our experiments in three diverse domains — classical Blocksworld and the embodied ALFWorld environment with both textual and visual states — show that NL-PDDL substantially outperforms existing baselines, is more robust to longer horizons and more complex goals, and generalizes across modalities.
MoSA: Mosaic Shared Adaptation of Large Language Models
Xiequn Wang ⋅ Zhan ZHUANG ⋅ Shengda Luo ⋅ Yu Zhang
We introduce MoSA, a new parameter-efficient fine-tuning (PEFT) method that replaces low-rank factorization with randomized, fine-grained sharing of weight updates. Each adapted weight matrix is constructed by broadcasting a small set of learned scalars over a fixed tessellation, a pre-defined group assignment of weight entries of the weight matrix, producing expressive changes under the same parameter budget as low-rank adaptation (LoRA). MoSA requires no architectural changes and can be merged into the base model for zero-overhead inference. Across diverse language understanding and generation tasks, MoSA matches or surpasses strong PEFT baselines under strictly matched budgets. Analyses and ablations indicate that non-local parameter sharing acts as an effective regularizer, and that grouping design and budget allocation govern the expressivity–efficiency trade-off. These results position MoSA as a simple, scalable alternative to LoRA. Our code is available at https://github.com/XiequnWang/MoSA-ICLR26.
Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds
Rishikesh Srinivasan ⋅ Dheeraj Nagaraj
We study the problem of sampling from strongly log-concave distributions over $\mathbb{R}^d$ using the Poisson midpoint discretization (a variant of the randomized midpoint method) for overdamped/underdamped Langevin dynamics. We prove its convergence in the 2-Wasserstein distance ($\mathcal W_2$), achieving a cubic speedup in dependence on the target accuracy ($\epsilon$) over the Euler-Maruyama discretization, surpassing existing bounds for randomized midpoint methods. Notably, in the case of underdamped Langevin dynamics, we demonstrate the complexity of $\mathcal W_2$ convergence is much smaller than the complexity lower bounds for convergence in $L^2$ strong error established in the literature.
A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
Shuo Xie ⋅ Tianhao Wang ⋅ Beining Wu ⋅ Zhiyuan Li
Adaptive optimizers can reduce to normalized steepest descent (NSD) when only adapting to the current gradient, suggesting a close connection between the two algorithmic families. A key distinction between their analyses, however, lies in the geometries, e.g., smoothness notions, they rely on. In the convex setting, adaptive optimizers are governed by a stronger adaptive smoothness condition, while NSD relies on the standard notion of smoothness. We extend the theory of adaptive smoothness to the nonconvex setting and show that it precisely characterizes the convergence of adaptive optimizers. Moreover, we establish that adaptive smoothness enables acceleration of adaptive optimizers with Nesterov momentum in the convex setting, a guarantee unattainable under standard smoothness for certain non-Euclidean geometry. We further develop an analogous comparison for stochastic optimization by introducing adaptive gradient variance, which parallels adaptive smoothness and leads to dimension-free convergence guarantees that cannot be achieved under standard gradient variance for certain non-Euclidean geometry.
Pyramid Patchification Flow for Visual Generation
Hui Li ⋅ Baoyou Chen ⋅ Jiaye Li ⋅ Jingdong Wang ⋅ Siyu Zhu
Diffusion Transformers (DiTs) typically use the same patch size for $\operatorname{Patchify}$ across timesteps, enforcing a constant token budget across timesteps. In this paper, we introduce Pyramidal Patchification Flow (PPFlow), which reduces the number of tokens for high-noise timesteps to improve the sampling efficiency. The idea is simple: use larger patches at higher-noise timesteps and smaller patches at lower-noise timesteps. The implementation is easy: share the DiT's transformer blocks across timesteps, and learn separate linear projections for different patch sizes in $\operatorname{Patchify}$ and $\operatorname{Unpatchify}$. Unlike Pyramidal Flow that operates on pyramid representations,, our approach operates over full latent representations, eliminating trajectory ``jump points'', and thus avoiding re-noising tricks for sampling. Training from pretrained SiT-XL/2 requires only $+8.9\%$ additional training FLOPs and delivers $2.02\times$ denoising speedups with image generation quality kept; training from scratch achieves comparable sampling speedup, e.g., $2.04\times$ speedup in SiT-B. Training from text-to-image model FLUX.1, PPFlow can achieve $1.61 - 1.86 \times$ speedup from 512 to 2048 resolution with comparable quality.
Sparsity Forcing: Reinforcing Token Sparsity of MLLMs
Feng Chen ⋅ YEFEI HE ⋅ Lequan Lin ⋅ Jing Liu ⋅ Chenhui Gou ⋅ Bohan Zhuang ⋅ Qi Wu
Sparse attention mechanisms aim to reduce computational overhead with minimal accuracy loss by selectively processing salient tokens. Despite their effectiveness, most methods merely exploit a model’s inherent sparsity and thus plateau at moderate budgets (about 50\% token reduction), with little headroom to push budget lower without hurting accuracy. Other approaches attempt to enforce sparsity through trainable sparse attention or sharpness-inducing regularizers, but these either fix rigid patterns that ignore input and layer dynamics, or optimize proxy objectives without direct control over token budgets. In this paper, we explicitly reinforce token sparsity in well-posed multimodal large language models (MLLMs) through a simple RL-based post-training framework named $\textit{Sparsity Forcing}$. Our method explores the efficiency-accuracy trade-off by running multiple rollouts with different token budgets, where both efficiency (token reduction ratio) and performance (answer correctness) are formulated as joint rewards. By contrasting rollouts within each group, the more efficient and correct answer is rewarded while less efficient or incorrect ones are penalized, thereby turning token saving into an end-to-end, inference-consistent optimization objective. Across thirteen image and video benchmarks, Sparsity Forcing raises token reduction ratio on Qwen2-VL/Qwen2.5-VL from 20\% to 75\% with minimal accuracy decline, significantly reducing long-context inference memory by up to 3$\times$ while speeding up decoding by up to 3.3$\times$.
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Anya Sims ⋅ Thomas Foster ⋅ T. Nguyen-Hien ⋅ Klara Kaleb ⋅ Joseph Lee ⋅ Jakob Foerster ⋅ Yee Whye Teh ⋅ Cong Lu
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with seemingly simple subword-level tasks, like counting the number of 'r's in 'strawberry'. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper, we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline, and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models.
Skirting Additive Error Barriers for Private Turnstile Streams
Anders Aamand ⋅ Justin Chen ⋅ Sandeep Silwal
We study differentially private continual release of the number of distinct items in a turnstile stream, where items may be both inserted and deleted. A recent work of Jain, Kalemaj, Raskhodnikova, Sivakumar, and Smith (NeurIPS '23) shows that for streams of length $T$, polynomial additive error of $\Omega(T^{1/4})$ is necessary, even without any space restrictions. We show that this additive error lower bound can be circumvented if the algorithm is allowed to output estimates with both additive \emph{and multiplicative} error. We give an algorithm for the continual release of the number of distinct elements with $\text{polylog} (T)$ multiplicative and $\text{polylog}(T)$ additive error. We also show a qualitatively similar phenomenon for estimating the $F_2$ moment of a turnstile stream, where we can obtain $1+o(1)$ multiplicative and $\text{polylog} (T)$ additive error. Both results can be achieved using polylogarithmic space whereas prior approaches use polynomial space. In the sublinear space regime, some multiplicative error is necessary even if privacy is not a consideration. We raise several open questions aimed at better understanding trade-offs between multiplicative and additive error in private continual release.
Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Wang Zhu ⋅ Tian-qi Chen ⋅ Xinyan Yu ⋅ Ching Lin ⋅ Jade Law ⋅ Mazen Jizzini ⋅ Jorge Nieva ⋅ Ruishan Liu ⋅ Robin Jia
Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions} in the questions, posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM---including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet---corrects these false presuppositions more than $43\%$ of the time. To study mitigation strategies, we further construct a 150-question Cancer-Myth-NFP set, in which physicians confirm the absence of false presuppositions. We find typical mitigation strategies, such as adding precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on other medical benchmarks. These findings highlight a critical gap in the reliability of LLMs, show that prompting alone is not a reliable remedy for false presuppositions, and underscore the need for more robust safeguards in medical AI systems.
PERSISTENCE SPHERES: BI-CONTINUOUS REPRESENTATIONS OF PERSISTENCE DIAGRAMS.
Matteo Pegoraro
Persistence spheres are a new functional representation of persistence diagrams. In contrast to existing embeddings such as persistence images, landscapes, or kernel-based methods, persistence spheres define a bi-continuous mapping: they are Lipschitz continuous with respect to the 1-Wasserstein distance and admit a continuous inverse on their image. This provides both stability and geometric fidelity, placing persistence spheres among the few representations of persistence diagrams that offer an inverse-continuity guarantee. We derive explicit formulas for persistence spheres and show that they can be computed efficiently with minimal parallelization overhead. Empirically, we evaluate them on clustering, regression, and classification tasks involving functional data, time series, graphs, meshes, and point clouds. Across these benchmarks, persistence spheres are competitive with, and often improve upon, standard baselines including persistence images, persistence landscapes, persistence splines, and the sliced Wasserstein kernel. Additional simulations in the appendices further support the method and provide practical guidance for tuning its parameters.
Nasty Adversarial Training: A Probability Sparsity Perspective for Robustness Enhancement
Yuhang Zhou ⋅ Zhongyun Hua ⋅ Zhaoquan Gu ⋅ Keke Tang ⋅ Rushi Lan ⋅ Yushu Zhang ⋅ Qing Liao ⋅ Leo Zhang
The vulnerability of deep neural networks to adversarial examples poses significant challenges to their reliable deployment. Among existing empirical defenses, adversarial training and robust distillation have proven the most effective. In this paper, we identify a property originally associated with model intellectual property, i.e., probability sparsity induced by nasty training, and demonstrate that it can also provide interpretable improvements to adversarial robustness. We begin by analyzing how nasty training induces sparse probability distributions and qualitatively explore the spatial metric preferences this sparsity introduces to the model. Building on these insights, we propose a simple yet effective adversarial training method, nasty adversarial training (NAT), which incorporates probability sparsity as a regularization mechanism to boost adversarial robustness. Both theoretical analysis and experimental results validate the effectiveness of NAT, highlighting its potential to enhance the adversarial robustness of deep neural networks in an interpretable manner.
On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations
Jianing Guo ⋅ Zhenhong Wu ⋅ Chang Tu ⋅ Yiyao Ma ⋅ Xiangqi Kong ⋅ Zhiqian Liu ⋅ Jiaming Ji ⋅ Shuning Zhang ⋅ Yuanpei Chen ⋅ Kai Chen ⋅ Qi Dou ⋅ Yaodong Yang ⋅ Xianglong Liu ⋅ Huijie Zhao ⋅ Weifeng Lv ⋅ Simin Li
In Vision–Language–Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) $\pi_0$ demonstrates superior robustness. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6\% on the $\pi_0$ backbone and 10.4\% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust BYOVLA that requires external LLMs, and a 10.4\% gain under mixed perturbations. On the real-world FR5 robot, under four types of multimodal perturbations, RobustVLA shows strong low-data performance, outperforming $\pi_0$ by $65.6\%$ success rate with 25 demonstrations. Even with abundant demos, our method still outperform $\pi_0$ by 30\% success rate. Code and demo videos available at https://github.com/gakakulicc/RobustVLA.
DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle
Yuheng Tang ⋅ Kaijie Zhu ⋅ Bonan Ruan ⋅ Chuqi Zhang ⋅ Michael Yang ⋅ Hongwei Li ⋅ Suyue Guo ⋅ Tianneng Shi ⋅ Zekun Li ⋅ Christopher Kruegel ⋅ Giovanni Vigna ⋅ Dawn Song ⋅ William Wang ⋅ Lun Wang ⋅ Yangruibo Ding ⋅ Zhenkai Liang ⋅ Wenbo Guo
Even though demonstrating extraordinary capabilities in code generation and software issue resolving, AI agents' capabilities in the full software DevOps cycle are still unknown. Different from pure code generation, handling the DevOps cycle in real-world software, including developing, deploying, and managing, requires analyzing large-scale projects, understanding dynamic program behaviors, leveraging domain-specific tools, and making sequential decisions. However, existing benchmarks focus on isolated problems and lack environments and tool interfaces for DevOps. We introduce DevOps-Gym, the first end-to-end benchmark for evaluating AI agents across core DevOps workflows: build and configuration, monitoring, issue resolving, and test generation. DevOps-Gym includes 700+ real-world tasks collected from 30+ projects in Java and Go. We develop a semi-automated data collection mechanism with rigorous and non-trivial expert efforts in ensuring the task coverage and quality. Our evaluation of state-of-the-art models and agents reveals fundamental limitations: they struggle with issue resolving and test generation in Java and Go, and remain unable to handle new tasks such as monitoring and build and configuration. These results highlight the need for essential research in automating the full DevOps cycle with AI agents.
SplitLoRA: Balancing Stability and Plasticity in Continual Learning Through Gradient Space Splitting
Haomiao Qiu ⋅ Miao Zhang ⋅ Ziyue Qiao ⋅ Weili Guan ⋅ Min Zhang ⋅ Liqiang Nie
Continual Learning (CL) requires a model to learn multiple tasks in sequence while maintaining both stability—preserving knowledge from previously learned tasks, and plasticity—effectively learning new tasks. Orthogonal projection has emerged as an effective and popular paradigm in CL, where it partitions the gradient space of previously learned tasks into two orthogonal subspaces: a primary subspace and a minor subspace. New tasks are learned effectively within the minor subspace, thereby reducing interference with previously acquired knowledge. However, existing orthogonal projection methods struggle to achieve an optimal balance between plasticity and stability, as it is hard to appropriately partition the gradient space. In this work, we consider a continual learning paradigm based on Low-Rank Adaptation (LoRA), which has gained considerable attention due to its efficiency and wide applicability, and propose a novel approach for continual learning, called SplitLoRA. We first provide a theoretical analysis of how subspace partitioning affects model stability and plasticity. Informed by this analysis, we then introduce an effective method that derives the optimal partition of the gradient space for previously learned tasks. This approach effectively balances stability and plasticity in continual learning. Experimental results on multiple datasets demonstrate that the proposed method achieves state-of-the-art performance. The code is available at https://github.com/qhmiao/SplitLoRA.
MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Zijian Zhou ⋅ Ao Qu ⋅ Zhaoxuan Wu ⋅ Sunghwan Kim ⋅ Alok Prakash ⋅ Daniela Rus ⋅ Bryan Kian Hsiang Low ⋅ Paul Liang
Modern language agents often need to solve long-horizon tasks requiring multiple turns of interactions with the environment, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to un-bounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths due to LLM forgetting the context. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with near constant context size when solving long-horizon tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. Leveraging reinforcement learning (RL) and rollout trajectory truncation, we train a MEM1 agent to develop internal states that integrate prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5$\times$ while reducing memory usage by 3.7$\times$ compared to Qwen2.5-14B-Instruct on an augmented multi-hop QA dataset with 16 objectives in each task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon task-solving agents that involve multiple interactions, where both efficiency and performance are optimized. Code can be found at https://github.com/MIT-MI/MEM1.
To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
Zeyu Yang ⋅ Tianyi Zhang ⋅ Jianwen Xie ⋅ Chuan Li ⋅ Zhaozhuo Xu ⋅ Anshumali Shrivastava
The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision \emph{floating-point} formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an \emph{exponent concentration} phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9\% memory savings and 177.1\% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era. Code is available at https://github.com/zeyuyang8/ecf8.
Cross-Modal Redundancy and the Geometry of Vision–Language Embeddings
Grégoire DHIMOÏLA ⋅ Thomas Fel ⋅ Victor Boutin ⋅ Agustin Picard
Vision–language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
AttTok: Marrying Attribute Tokens with Generative Pre-trained Vision-Language Models towards Medical Image Understanding
Hualiang Wang ⋅ Xinyue Xu ⋅ Lehan Wang ⋅ Bin Pu ⋅ Xiaomeng Li
Recent generative pre-trained vision–language (GPTv) models have achieved remarkable success in multi-modal understanding, inspiring their adaptation to medical imaging tasks such as disease diagnosis and visual question answering (VQA). However, current instruction-tuned GPTv models suffer from two key challenges: (1) medical attributes (e.g., disease names, severity grades) are encoded as plain text tokens, collapsing semantically distinct concepts into nearly identical textual sequences; and (2) inadequate textual supervision weakens visual representation learning, leading to severe inter-attribute confusion and misaligned vision–language embeddings. To address these limitations, we introduce attribute tokens (AttTok), a set of pre‑defined special tokens that uniquely encode clinical attributes (e.g., imaging modality, diagnosis, severity) within a structured token space. Complemented by attribute‑centric embedding books, AttTok serves as anchor points for aligning both visual and textual modalities into a shared, discriminative representation space. Building on this foundation, we design two key components: an attribute‑centric cross attention (ACC) adapter, which breaks the vision‑to‑text information‑flow bottleneck and enriches the visual encoder with discriminative attribute knowledge, and an attribute‑centric matching (ACM) loss, which enforces robust multi‑modal alignment centered on the attribute tokens. Extensive experiments on five medical classification benchmarks and three VQA datasets demonstrate that AttTok substantially improves both discriminative accuracy and medical knowledge reasoning, establishing a new paradigm for medical GPTv models with clinically discriminative understanding.
Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
Vaishnavi Shrivastava ⋅ Ahmed H Awadallah ⋅ Vidhisha Balachandran ⋅ Shivam Garg ⋅ Harkirat Behl ⋅ Dimitris Papailiopoulos
Large language models trained with reinforcement learning on verifiable rewards often inflate response length—trading brevity for accuracy. While longer reasoning can help on hard problems, many extra tokens are filler: verbose text making little progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem and only training on responses filtered by (1) length and (2) token efficiency (reward per token). By sampling more during training time, GFPO teaches models to think less at inference time. On Phi-4-reasoning, GFPO cuts GRPO’s length inflation by up to 85\% across STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while preserving accuracy. We find that GFPO also outperforms Dr. GRPO in both accuracy and length reduction and generalizes across model sizes and families. We further propose Adaptive Difficulty GFPO, which allocates more training exploration to harder problems, yielding better efficiency-accuracy trade-offs on challenging questions. With only a 7\% increase in training time, GFPO reduces end-to-end latency by $\sim$30\%, cutting response time on hard queries by 90 seconds. GFPO trades modest training-time increases for lasting gains in inference—an effective recipe for efficient reasoning.
SketchEvo: Leveraging Drawing Dynamics for Enhanced Image Synthesis
Zhixin Feng ⋅ Runan Yin ⋅ LAN YANG ⋅ Kaiyue Pang ⋅ Ke Li ⋅ Honggang Zhang ⋅ Yi-Zhe Song
Sketching represents humanity's most intuitive form of visual expression -- a universal language that transcends barriers. Although recent diffusion models integrate sketches with text, they often regard the complete sketch merely as a static visual constraint, neglecting the human preference information inherently conveyed during the dynamic sketching process.This oversight leads to images that, despite technical adherence to sketches, fail to align with human aesthetic expectations. Our framework, SketchEvo, harnesses the dynamic evolution of sketches by capturing the progression from initial strokes to completed drawing. Current preference alignment techniques struggle with sketch-guided generation because the dual constraints of text and sketch create insufficiently different latent samples when using noise perturbations alone. SketchEvo addresses this through two complementary innovations: first, by leveraging sketches at different completion stages to create meaningfully divergent samples for effective aesthetic learning during training; second, through a sequence-guided rollback mechanism that applies these learned preferences during inference by balancing textual semantics with structural guidance. Extensive experiments demonstrate that these complementary approaches enable SketchEvo to deliver improved aesthetic quality while maintaining sketch fidelity, successfully generalizing to incomplete and abstract sketches throughout the drawing process.
NLI : Non-uniform Linear Interpolation Approximation of Nonlinear Operations for Efficient LLMs Inference
Jiangyong Yu ⋅ Xiaomeng Han ⋅ Xing Hu ⋅ XUCHEN ⋅ Zhe jiang ⋅ Dawei Yang
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks, but their deployment is often constrained by substantial memory footprints and computational costs. While prior work has achieved significant progress in compressing and accelerating linear layers, nonlinear layers—such as SiLU, RMSNorm, and Softmax—still heavily depend on high-precision floating-point operations. In this paper, we propose a calibration-free, dynamic-programming-optimal, and hardware-friendly framework called \underline{N}on-uniform \underline{L}inear \underline{I}nterpolation (NLI). NLI is capable of efficiently approximating a variety of nonlinear functions, enabling seamless integration into LLMs and other deep neural networks with almost no loss in accuracy. NLI ingeniously recasts cutpoint selection as a dynamic-programming problem, achieving the \emph{globally} minimal interpolation error in $\mathcal{O}(M \times N^2)$ time via Bellman’s optimality principle. Based on the NLI algorithm, we also design and implement a plug-and-play universal nonlinear computation unit. Hardware experiments demonstrate that the NLI Engine achieves more than 4× improvement in computational efficiency compared to the state-of-the-art designs.
On Predictability of Reinforcement Learning Dynamics for Large Language Models
Cai Yuchen ⋅ Ding Cao ⋅ Xin Xu ⋅ Zijun Yao ⋅ Yuqing Huang ⋅ Benyi Zhang ⋅ Zhenyu Tan ⋅ Guiquan Liu ⋅ Junfeng Fang
Recent advances in reasoning capabilities of large language models (LLMs) are largely driven by reinforcement learning (RL), yet the underlying parameter dynamics during RL training remain poorly understood. This work identifies two fundamental properties of RL-induced parameter updates in LLMs: (1) Rank-1 Dominance, where the top singular subspace of the parameter update matrix nearly fully determines reasoning improvements, recovering over 99\% of performance gains; and (2) Rank-1 Linear Dynamics, where this dominant subspace evolves linearly throughout training, enabling accurate prediction from early checkpoints. Extensive experiments across 13 LLMs and 10 algorithms validate the generalizability of these properties. More importantly, based on these findings, we propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window, achieving up to 2.5× speedup while retaining > 96\% of reasoning performance without extra modules or hyperparameter tuning. This positions our finding as a versatile and practical tool for large-scale RL, opening a path toward principled, interpretable, and efficient training paradigm for LLMs.
Dichotomous Diffusion Policy Optimization
Ruiming Liang ⋅ Yinan Zheng ⋅ Kexin ZHENG ⋅ Tianyi Tan ⋅ Jianxiong Li ⋅ Liyuan Mao ⋅ Zhihao Wang ⋅ Guang Chen ⋅ Hangjun Ye ⋅ Jingjing Liu ⋅ Jinqiao Wang ⋅ Xianyuan Zhan
Diffusion-based policies have gained growing popularity in solving a wide range of decision-making tasks due to their superior expressiveness and controllable generation during inference. However, effectively training large diffusion policies using reinforcement learning (RL) remains challenging. Existing methods either suffer from unstable training due to directly maximizing value objectives, or face computational issues due to relying on crude Gaussian likelihood approximation, which requires a large amount of sufficiently small denoising steps. In this work, we propose DIPOLE (Dichotomous diffusion Policy improvement), a novel RL algorithm designed for stable and controllable diffusion policy optimization. We begin by revisiting the KL-regularized objective in RL, which offers a desirable weighted regression objective for diffusion policy extraction, but often struggles to balance greediness and stability. We then formulate a greedified policy regularization scheme, which naturally enables decomposing the optimal policy into a pair of stably learned dichotomous policies: one aims at reward maximization, and the other focuses on reward minimization. Under such a design, optimized actions can be generated by linearly combining the scores of dichotomous policies during inference, thereby enabling flexible control over the level of greediness.Evaluations in offline and offline-to-online RL settings on ExORL and OGBench demonstrate the effectiveness of our approach. We also use DIPOLE to train a large vision-language-action (VLA) model for end-to-end autonomous driving (AD) and evaluate it on the large-scale real-world AD benchmark NAVSIM, highlighting its potential for complex real-world applications.
OPRIDE: Efficient Offline Preference-based Reinforcement Learning via In-Dataset Exploration
Yiqin Yang ⋅ Hao Hu ⋅ Yihuan Mao ⋅ Jin Zhang ⋅ Chengjie Wu ⋅ Yuhua Jiang ⋅ Xu Yang ⋅ Runpeng Xie ⋅ Yi Fan ⋅ Bo Liu ⋅ Yang Gao ⋅ Bo XU ⋅ Chongjie Zhang
Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, Offline PbRL via In-Dataset Exploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.
Distilled Pretraining: A modern lens of Data, In-Context Learning and Test-Time Scaling
Sachin Goyal ⋅ David Lopez-Paz ⋅ Kartik Ahuja
In the past year, distillation has seen a renewed prominence in large language model (LLM) pretraining, exemplified by the Llama-3.2 and Gemma model families. While distillation has historically been shown to improve statistical modeling, its effects on new paradigms key to modern LLMs—such as test-time scaling and in-context learning—remain underexplored. In this work, we make three main contributions. First, we show that pretraining with distillation yields models that exhibit remarkably better test-time scaling. Second, we observe that this benefit comes with a trade-off: distillation impairs in-context learning capabilities, particularly the one modeled via induction heads. Third, to demystify these findings, we study distilled pretraining in a sandbox of a bigram model, which helps us isolate the common principal factor behind our observations. Finally, using these insights, we shed light on various design choices for pretraining that should help practitioners going forward.
Consis-GCPO: Consistency-Preserving Group Causal Preference Optimization for Vision Customization
Qiaoqiao Jin ⋅ Dong She ⋅ Siming Fu ⋅ Hualiang Wang ⋅ weihao zhou ⋅ Jidong Jiang
Subject-driven generation faces a fundamental challenge: achieving high subject fidelity while maintaining semantic alignment with textual descriptions. While recent GRPO-based approaches have shown promise in aligning generative models with human preferences, they apply uniform optimization across all denoising timesteps, ignoring the temporal dynamics of how textual and visual conditions influence generation. We present Consis-GCPO, a causal reinforcement learning framework that reformulates multi-modal condition generation through discrete-time causal modeling. Our key insight is that different conditioning signals exert varying influence throughout the denoising process—text guides semantic structure in early steps while visual references anchor details in later stages. By introducing decoupled causal intervention trajectories, we quantify instantaneous causal effects at each timestep, transforming these measurements into temporally-weighted advantages for targeted optimization. This approach enables precise tracking of textual and visual contributions, ensuring accurate credit assignment for each conditioning modality. Extensive experiments demonstrate that Consis-GCPO significantly advances personalized generation, achieving superior subject consistency while preserving strong text-following capabilities, particularly excelling in complex multi-subject scenarios.
Obscure but Effective: Classical Chinese Jailbreak Prompt Optimization via Bio-Inspired Search
Xun Huang ⋅ Simeng Qin ⋅ Xiaoshuang Jia ⋅ Ranjie Duan ⋅ Huanqian Yan ⋅ Zhitao Zeng ⋅ Fei Yang ⋅ Yang Liu ⋅ Yang Liu
As Large Language Models (LLMs) are increasingly used, their security risks have drawn increasing attention. Existing research reveals that LLMs are highly susceptible to jailbreak attacks, with effectiveness varying across language contexts. This paper investigates the role of classical Chinese in jailbreak attacks. Owing to its conciseness and obscurity, classical Chinese can partially bypass existing safety constraints, exposing notable vulnerabilities in LLMs. Based on this observation, this paper proposes a framework, CC-BOS, for the automatic generation of classical Chinese adversarial prompts based on multi-dimensional fruit fly optimization, facilitating efficient and automated jailbreak attacks in black-box settings. Prompts are encoded into eight policy dimensions—covering role, behavior, mechanism, metaphor, expression, knowledge, trigger pattern and context; and iteratively refined via smell search, visual search, and cauchy mutation. This design enables efficient exploration of the search space, thereby enhancing the effectiveness of black-box jailbreak attacks. To enhance readability and evaluation accuracy, we further design a classical Chinese to English translation module. Extensive experiments demonstrate that effectiveness of the proposed CC-BOS, consistently outperforming state-of-the-art jailbreak attack methods.
Seesaw: Accelerating Training by Balancing Batch Size and Learning Rate Scheduling
Alexandru Meterez ⋅ Depen Morwani ⋅ Jingfeng Wu ⋅ Costin-Andrei Oncescu ⋅ Cengiz Pehlevan ⋅ Sham Kakade
Increasing the batch size during training --- a “batch ramp'' --- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal strategy for adaptive optimizers like Adam is less clear. As a result, any batch-ramp scheduling, if used at all, is typically tuned heuristically. This work develops a principled framework for batch-size scheduling and introduces Seesaw: whenever a standard scheduler would halve the learning rate, Seesaw instead multiplies it by $1/\sqrt{2}$ and doubles the batch size, preserving loss dynamics while reducing serial steps. Theoretically, we provide, to our knowledge, the first finite-sample proof of equivalence between learning-rate decay and batch-size ramp-up for SGD on noisy linear regression, and we extend this equivalence to normalized SGD, a tractable proxy for Adam, under a variance-dominated regime observed in practice. Empirically, on 150M/300M/600M-parameter models trained at Chinchilla scale using a constant (critical) batch size, Seesaw matches cosine decay at equal FLOPs while reducing wall-clock time by $\approx 36\%$, approaching the theoretical limit implied by our analysis.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
Xinhao Li ⋅ Yi Wang ⋅ Jiashuo Yu ⋅ Xiangyu Zeng ⋅ Yuhan Zhu ⋅ Haian Huang ⋅ Jianfei Gao ⋅ Kunchang Li ⋅ Yinan He ⋅ Chenting Wang ⋅ Yu Qiao ⋅ Yali Wang ⋅ Limin Wang
Long-context video modeling is critical for multimodal large language models (MLLMs), enabling them to process movies, online video streams, and so on. Despite its advances, handling long videos remains challenging due to the difficulty in efficiently understanding the extremely long video context. This paper aims to address this issue from aspects of the model architecture, training data, training strategy, and evaluation benchmark. First, we propose a novel Hierarchical video token Compression (HiCo) method, which leverages visual redundancy in long videos to compress long video context from Clip-level to Video-level, reducing the computation significantly while preserving essential details, achieving an extreme compression ratio of approximately 1/50 with almost no performance loss. Second, we introduce a multi-stage short-to-long learning scheme, a large-scale dataset of real-world long videos named LongVid, and a challenging “Multi-Hop Needle-In- A-Video-Haystack” benchmark. Finally, we build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks at the 2B and 7B model scales. It first gets 99.1% accuracy over 10,000 frames in NIAH among open-source models.
PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction For Continual Learning
Simon Yu ⋅ Gang Li ⋅ Weiyan Shi ⋅ Peng Qi
Large language models (LLMs) are moving beyond static uses and are now powering agents that learn during their interaction with external environments. For example, agents can learn reusable skills while navigating web pages or toggling new tools. However, existing methods for skill learning often create skills that are over-specialized to a single website and fail to generalize. We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills. The core idea, inspired by polymorphism in software engineering, is to decouple a skill's abstract goal (what it accomplishes) and its concrete implementation (how it is executed). Experiments show that our method (1) improves skill reuse by 1.7x on seen websites and (2) boosts success rates by up to 9.4\% on Mind2Web and 13.9\% on unseen websites, while reducing steps by over 20\%. (3) In self-exploration settings without specified tasks, our framework improves the quality of proposed tasks and enables agents to learn generalizable skills that work across different sites. By enabling the agent to identify and refine its own goals, the PolySkill enhance the agent a better curriculum, leading to the acquisition of more generalizable skills compared to baseline methods. Our findings show that separating a skill's goal from its execution is a crucial step toward developing autonomous agents that can learn and generalize across the open web continuously. Our code can be found in \href{https://github.com/simonucl/PolySkill}{\texttt{https://github.com/simonucl/PolySkill}}.
SR-Scientist: Scientific Equation Discovery With Agentic AI
Shijie Xia ⋅ Yuhan Sun ⋅ Pengfei Liu
Recently, Large Language Models (LLMs) have been applied to scientific equation discovery, leveraging their embedded scientific knowledge for hypothesis generation. However, current methods typically confine LLMs to the role of an equation proposer within search algorithms like genetic programming. In this paper, we present SR-Scientist, a framework that elevates the LLM from a simple equation proposer to an autonomous AI scientist that writes code to analyze data, implements the equation as code, submits it for evaluation, and optimizes the equation based on experimental feedback. Specifically, we wrap the code interpreter into a set of tools for data analysis and equation evaluation. The agent is instructed to optimize the equation by utilizing these tools over a long horizon with minimal human-defined pipelines. Empirical results show that SR-Scientist outperforms baseline methods by an absolute margin of 6\% to 35\% on datasets covering four science disciplines. Additionally, we demonstrate our method's robustness to noise, the generalization of the discovered equations to out-of-domain data, and their symbolic accuracy. Furthermore, we develop an end-to-end reinforcement learning framework to enhance the agent's capabilities.
Rolling Forcing: Autoregressive Long Video Diffusion in Real Time
Kunhao Liu ⋅ Wenbo Hu ⋅ Jiale Xu ⋅ Ying Shan ⋅ Shijian Lu
Streaming video generation as one fundamental component in interactive world models and neural game engines aims to generate high-quality, low-latency, and temporally coherent long stream videos. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key–value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.
Tree-sliced Sobolev IPM
Viet-Hoang Tran ⋅ Thanh Q. Tran ⋅ Thanh Chu ⋅ Duy-Tung Pham ⋅ Trung-Khang Tran ⋅ Tam Le ⋅ Tan Nguyen
Recent work shows Tree-Sliced Optimal Transport to be an efficient and more expressive alternative to Sliced Wasserstein (SW), improving downstream performance. Tree-sliced metrics compare probability distributions by projecting measures onto tree metric spaces; a central example is the Tree-Sliced Wasserstein (TSW) distance, which applies the $1$-Wasserstein metric after projection. However, computing tree-based $p$-Wasserstein for general $p$ is costly, largely confining practical use to $p=1$. This restriction is a significant bottleneck, as higher-order metrics ($p > 1$) are preferred in gradient-based learning for their more favorable optimization landscapes. In this work, we revisit Sobolev integral probability metrics (IPM) on trees to obtain a practical generalization of TSW. Building on the insight that a suitably regularized Sobolev IPM admits a closed-form expression, we introduce TS-Sobolev, a tree-sliced metric that aggregates regularized Sobolev IPMs over random tree systems and remains tractable for all $p \ge 1$; for $p>1$, TS-Sobolev has the same computational complexity as TSW at $p=1$. Notably, at $p=1$ it recovers TSW exactly. Consequently, TS-Sobolev serves as a drop-in replacement for TSW in practical applications, with an additional flexibility in changing $p$. Furthermore, we extend this framework to define a corresponding metric for probability measures on hyperspheres. Experiments on Euclidean and spherical datasets show that TS-Sobolev and its spherical variant improve downstream performance in gradient flows, self-supervised learning, generative modeling, and text topic modeling over recent SW and TSW variants. Our code is available at [https://github.com/thanhquangtran/TS-Sobolev](https://github.com/thanhquangtran/TS-Sobolev).
Hierarchical Encoding Tree with Modality Mixup for Cross-modal Hashing
Zhiping Xiao ⋅ Junyu Luo ⋅ Hang Zhou ⋅ Yusheng Zhao ⋅ Xiao Luo ⋅ Pengyun Wang ⋅ Wei Ju ⋅ Siyu Heng ⋅ Ming Zhang
Cross-modal retrieval is a fundamental task that aims to learn semantic correspondences across different data modalities, such as visual and textual modalities. Unsupervised hashing methods can efficiently manage large-scale data and can be effectively applied to cross-modal retrieval studies.However, existing methods typically fail to fully exploit the hierarchical semantic structure within text and image data, where instances naturally organize into multi-level communities of varying granularity. Moreover, the commonly-used direct modal alignment cannot effectively bridge the semantic gap between these two modalities. To address these issues, we introduce a novel Hierarchical Encoding Tree with Modality Mixup (HINT) method, which achieves effective cross-modal retrieval by extracting hierarchical cross-modal relations. HINT constructs a cross-modal encoding tree guided by hierarchical structural entropy and generates proxy samples of text and image modalities for each instance from the encoding tree. Through the curriculum-based mixup of proxy samples, HINT achieves progressive modal alignment and effective cross-modal retrieval. We also conduct cross-modal consistency learning to achieve global-view semantic alignment between text and image representations. Extensive experiments on a range of cross-modal retrieval datasets demonstrate the superiority of HINT over state-of-the-art methods.
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Haonian Ji ⋅ Shi Qiu ⋅ Siyang Xin ⋅ Siwei Han ⋅ Zhaorun Chen ⋅ Dake Zhang ⋅ Hongyi Wang ⋅ Huaxiu Yao
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations.
BaseReward: A Strong Baseline for Multimodal Reward Model
YiFan Zhang ⋅ HaiHuaYang ⋅ Huanyu Zhang ⋅ Yang Shi ⋅ Zezhou Chen ⋅ Haochen Tian ⋅ Chaoyou Fu ⋅ Kai WU ⋅ Bo Cui ⋅ Xu Wang ⋅ Jianfei Pan ⋅ Haotian Wang ⋅ Zhang Zhang ⋅ Liang Wang
The rapid advancement of Multimodal Large Language Models (MLLMs) has made aligning them with human preferences a critical challenge. Reward Models (RMs) are a core technology for achieving this goal, but a systematic guide for building state-of-the-art Multimodal Reward Models (MRMs) is currently lacking in both academia and industry. Through exhaustive experimental analysis, this paper aims to provide a clear “recipe” for constructing high-performance MRMs. We systematically investigate every crucial component in the MRM development pipeline, including reward modeling paradigms (e.g., Naive-RM, Critic-based RM, and Generative RM), reward head architecture, training strategies, data curation (covering over ten multimodal and text-only preference datasets), backbone model and model scale, and ensemble methods. Based on these experimental insights, we introduce BaseReward, a powerful and efficient baseline for multimodal reward modeling. BaseReward adopts a simple yet effective architecture, built upon a Qwen2.5-VL backbone, featuring an optimized two-layer reward head, and is trained on a carefully curated mixture of high-quality multimodal and text-only preference data. Our results show that BaseReward establishes a new state-of-the-art (SOTA) on major benchmarks such as MM-RLHF-Reward Bench, VL-Reward Bench, and Multimodal Reward Bench, outperforming previous open-source and proprietary models. Furthermore, to validate its practical utility beyond static benchmarks, we integrate BaseReward into a real-world reinforcement learning pipeline, successfully enhancing an MLLM’s performance across various perception, reasoning, and conversational tasks. This work not only delivers a top-tier MRM but, more importantly, provides the community with a clear, empirically backed guide for developing robust reward models for the next generation of MLLMs.
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Zhenghai Xue ⋅ Longtao Zheng ⋅ Qian Liu ⋅ Yingru Li ⋅ Xiaosen Zheng ⋅ Zejun MA ⋅ Bo An
Large Language Models (LLMs) can enhance their reasoning by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn settings using Reinforcement Learning (RL) often exhibits training instability and degraded performance. We attribute the instability to harmful negative samples resulting from distributional drift and compounding errors induced by using external tool outputs during multi-turn rollout. To address this issue, we introduce SimpleTIR, a simple method that stabilizes multi-turn TIR training via filtering out trajectories with "void turns", i.e., turns that yield neither a code block nor a final answer. Specifically, we remove those trajectories from the policy update to block harmful gradients, while retaining them in advantage estimation to keep the estimate unbiased. Extensive experiments show that SimpleTIR effectively mitigates gradient norm explosion and stabilizes multi-turn RL training from base models. It achieves state-of-the-art performance on challenging math reasoning benchmarks, including an AIME24 score of 50.5 starting from the Qwen2.5-7B base model. SimpleTIR also promotes more diverse reasoning behaviors such as self-correction and cross-validation, outperforming prior methods trained from stronger instruction-tuned models.
Taming Curvature: Architecture Warm-up for Stable Transformer Training
Sameera Ramasinghe ⋅ Thalaiyasingam Ajanthan ⋅ Hadi Mohaghegh Dolatabadi ⋅ Chamin Hewa Koneputugodage ⋅ Gil Avraham ⋅ Violetta Shevchenko ⋅ Yan Zuo ⋅ Karol Pajak ⋅ Alexander Long
Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.
Scaling Agents via Continual Pre-training
Liangcai Su ⋅ Zhen Zhang ⋅ Guangyu Li ⋅ Zhuo Chen ⋅ Chenxi Wang ⋅ Maojia Song ⋅ Xinyu Wang ⋅ Kuan Li ⋅ Jialong Wu ⋅ Xuanzhong Chen ⋅ Zile Qiao ⋅ Zhongwang Zhang ⋅ Huifeng Yin ⋅ Shihao Cai ⋅ Runnan Fang ⋅ Zhengwei Tao ⋅ Wenbiao Yin ⋅ Rui Ye ⋅ Yong Jiang ⋅ Ningyu Zhang ⋅ Pengjun Xie ⋅ Fei Huang ⋅ Kai Ye ⋅ Kewei Tu ⋅ Chenxiong Qian ⋅ Jingren Zhou
Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
LeRobot: An Open-Source Library for End-to-End Robot Learning
Remi Cadene ⋅ Simon Alibert ⋅ Francesco Capuano ⋅ Michel Aractingi ⋅ Adil Zouitine ⋅ Pepijn Kooijmans ⋅ Jade Choghari ⋅ Martino Russi ⋅ Caroline Pascal ⋅ Steven Palma ⋅ Dana Aubakirova ⋅ Mustafa Shukor ⋅ Jess Moss ⋅ Alexander Soare ⋅ Quentin Lhoest ⋅ Quentin Gallouédec ⋅ Thomas Wolf
Robotics is undergoing a significant transformation powered by advances in high-level control techniques based on machine learning, giving rise to the field of robot learning. Recent progress in robot learning has been accelerated by the increasing availability of affordable teleoperation systems, large-scale openly available datasets, and scalable learning-based methods. However, development in the field of robot learning is often slowed by fragmented, closed-source tools designed to only address specific sub-components within the robotics stack. In this paper, we present lerobot, an open-source library that integrates across the entire robotics stack, from low-level middleware communication for motor controls to large-scale dataset collection, storage and streaming. The library is designed with a strong focus on real-world robotics, supporting accessible hardware platforms while remaining extensible to new embodiments. It also supports efficient implementations for various state-of-the-art robot learning algorithms from multiple prominent paradigms, as well as a generalized asynchronous inference stack. Unlike traditional pipelines which heavily rely on hand-crafted techniques, lerobot emphasizes scalable learning approaches that improve directly with more data and compute. Designed for accessibility, scalability, and openness, lerobot lowers the barrier to entry for researchers and practitioners to robotics while providing a platform for reproducible, state-of-the-art robot learning.
Q&C: When Quantization Meets Cache in Efficient Generation
Xin Ding ⋅ Xin Li ⋅ Haotong Qin ⋅ Zhibo Chen
Quantization and cache mechanisms are typically applied individually in efficient generation tasks, each showing notable potential for acceleration. However, their joint effect on efficiency remains under-explored. Through both empirical investigation and theoretical analysis, we find that that combining quantization with caching is non-trivial, as it introduces two major challenges that severely degrade performance: (i) the sample efficacy of calibration datasets in post-training quantization (PTQ) is significantly eliminated by cache operation; (ii) the joint use of the two mechanisms exacerbates exposure bias in the sampling distribution, leading to amplified error accumulation during generation. In this work, we take advantage of these two acceleration mechanisms and propose a hybrid acceleration method by tackling the above challenges, aiming to further improve the efficiency of tasks while maintaining excellent generation capability. Concretely, a temporal-aware parallel clustering (TAP) is designed to dynamically improve the sample selection efficacy for the calibration within PTQ for different diffusion steps. A variance compensation (VC) strategy is derived to correct the sampling distribution. It mitigates exposure bias through an adaptive correction factor generation. Extensive experiments demonstrate that our method is broadly applicable to diverse generation tasks, achieving up to 12.7x acceleration while preserving competitive generation quality.
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao ⋅ Chengxiao Dai ⋅ Wei Zhuo ⋅ Yue Xiu ⋅ Dusit Niyato
Knowledge graphs provide structured context for multi‑hop question answering, but deployed systems must balance answer accuracy with strict latency and cost targets while preserving provenance. Static $k$‑hop expansions and ``think‑longer'' prompting often over‑retrieve, inflate context, and yield unpredictable runtime. Thus, we introduce CLAUSE, an agentic three-agent neuro‑symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to keep and when to stop. Latency (interaction steps) and prompt cost (selected tokens) are exposed as user‑specified budgets or prices, allowing per‑query adaptation to trade‑offs among accuracy, latency, and cost without retraining. CLAUSE employs the proposed Lagrangian‑Constrained Multi‑Agent Proximal Policy Optimization (LC‑MAPPO) algorithm to coordinate three agents: Subgraph Architect, Path Navigator, and Context Curator, so that subgraph construction, reasoning paths discovery, and evidence selection are jointly optimized under per‑query's resource budgets on edge edits, interaction steps, and selected tokens. Across HotpotQA, MetaQA, and FactKG, CLAUSE yields higher EM@1 while reducing subgraph growth and end-to-end latency at equal or lower token budgets. On MetaQA-2-hop, relative to the strongest RAG baseline (GraphRAG), CLAUSE achieves $+39.3$ EM@1 with 18.6% lower latency, and 40.9% lower edge growth. The resulting contexts are compact, provenance‑preserving, and deliver predictable performance under deployment constraints.
FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning
LIANG HU ⋅ Jianpeng Jiao ⋅ Jiashuo Liu ⋅ Dongyuan Mutu ⋅ Yanle Ren ⋅ Zhoufutu Wen ⋅ Kaiyuan Zhang ⋅ Xuanliang Zhang ⋅ Xiang Gao ⋅ Tianci He ⋅ FEI HU ⋅ Yali Liao ⋅ Zaiyuan Wang ⋅ Jingkai Liu ⋅ Sun Daibin ⋅ Ziqing Zeng ⋅ zhiyuan zeng ⋅ Chenghao Yang ⋅ Qianyu Yang ⋅ Mingren Yin ⋅ Ge Zhang ⋅ Xinyi zhang ⋅ Xiying ZHAO ⋅ Zhu Zhenwei ⋅ Hongseok Namkoong ⋅ Wenhao Huang
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks, Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation, closely reproducing real-world financial analyst workflows. To ensure difficulty and reliability, we engage $70$ professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes $635$ questions spanning global and Greater China markets, and we evaluate $21$ models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly. By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.
CoDA: From Text-to-Image Diffusion Models to Training-Free Dataset Distillation
Letian Zhou ⋅ Songhua Liu ⋅ Xinchao Wang
Prevailing Dataset Distillation (DD) methods leveraging generative models confront two fundamental limitations. First, despite pioneering the use of diffusion models in DD and delivering impressive performance, the vast majority of approaches paradoxically require a diffusion model pre-trained on the full target dataset, undermining the very purpose of DD and incurring prohibitive training costs. Second, although some methods turn to general text-to-image models without relying on such target-specific training, they suffer from a significant distributional mismatch, as the web-scale priors encapsulated in these foundation models fail to faithfully capture the target-specific semantics, leading to suboptimal performance. To tackle these challenges, we propose Core Distribution Alignment (CoDA), a framework that enables effective DD using only an off-the-shelf text-to-image model. Our key idea is to first identify the ``intrinsic core distribution'' of the target dataset using a robust density-based discovery mechanism. We then steer the generative process to align the generated samples with this core distribution. By doing so, CoDA effectively bridges the gap between general-purpose generative priors and target semantics, yielding highly representative distilled datasets. Extensive experiments suggest that, without relying on a generative model specifically trained on the target dataset, CoDA achieves performance on par with or even superior to previous methods with such reliance across all benchmarks, including ImageNet-1K and its subsets. Notably, it establishes a new state-of-the-art accuracy of 60.4\% at the 50-images-per-class (IPC) setup on ImageNet-1K. Our code is available on the project webpage: https://github.com/zzzlt422/CoDA
DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Xiang Liu ⋅ Xuming Hu ⋅ Xiaowen Chu ⋅ Eunsol Choi
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. We conduct a systematic analysis across models and datasets and discover a U-shaped entropy pattern: high entropy on simple problems despite high accuracy, low entropy on medium difficulty, and high entropy on hard problems reflecting uncertainty. The 22--25\% entropy reduction from simple to optimal regions reveals a fundamental inefficiency—an \emph{overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight, deployment-ready framework that predicts problem difficulty from hidden states and selects among Easy/Normal/Hard reasoning strategies to allocate computation adaptively. DiffAdapt requires no retraining of the base LLM and is compatible with common inference optimizations. Across five models and eight benchmarks, DiffAdapt achieves comparable or improved accuracy while reducing token usage by up to 22.4\%, establishing a practical path toward compute-efficient reasoning.
R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei ⋅ Jiang Guo ⋅ Shuaichen Chang ⋅ Mingwen Dong ⋅ Dongkyu Lee ⋅ Xing Niu ⋅ Jiarong Jiang
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLM's tendency to hallucination and their reliance on static training knowledge, which could lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models -- future state prediction and reward estimation -- through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs’ limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves relative improvements of up to 23.4\% and 16.3\% on the subsets of OSWorld and Webarena compared to baselines, with particular advantage in longer-horizon simulations.
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Yidong Wang ⋅ Yunze Song ⋅ Tingyuan Zhu ⋅ Xuanwang Zhang ⋅ Zhuohao Yu ⋅ Hao Chen ⋅ Chiyu Song ⋅ Qiufeng Wang ⋅ Zhen Wu ⋅ Xin-Yu Dai ⋅ Yue Zhang ⋅ Cunxiang Wang ⋅ Wei Ye ⋅ Shikun Zhang
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) \textit{Score-Comparison Inconsistency}, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) \textit{Pairwise Transitivity Inconsistency}, manifested through circular preference chains ($A\!>\!B\!>\!C\!>\!A$) and equivalence contradictions ($A\!=\!B\!=\!C\!\neq\!A$). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose \textbf{TrustJudge}, a probabilistic framework that addresses these limitations through two key innovations: 1) \textit{distribution-sensitive scoring} that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) \textit{likelihood-aware aggregation} that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43\% (from 23.32\% to 14.89\%) and Pairwise Transitivity inconsistency by 10.82\% (from 15.22\% to 4.40\%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations.
Joint Distribution–Informed Shapley Values for Sparse Counterfactual Explanations
Lei You ⋅ Yijun Bian ⋅ Lele Cao
Counterfactual explanations (CE) aim to reveal how small input changes flip a model’s prediction, yet many methods modify more features than necessary, reducing clarity and actionability. We introduce COLA, a model- and generator-agnostic post-hoc framework that refines any given CE by computing a coupling via optimal transport (OT) between factual and counterfactual sets and using it to drive a Shapley-based attribution p-SHAP that selects a minimal set of edits while preserving the target effect. Theoretically, OT minimizes an upper bound on the $W_1$ divergence between factual and counterfactual outcomes and that, under mild conditions, refined counterfactuals are guaranteed not to move farther from the factuals than the originals. Empirically, across four datasets, twelve models, and five CE generators, COLA achieves the same target effects with only 26–45% of the original feature edits. On a small-scale benchmark, COLA shows near-optimality.
GTA1: GUI Test-time Scaling Agent
Yan Yang ⋅ Dongxu Li ⋅ Yutong Dai ⋅ Yuhao Yang ⋅ Ziyang Luo ⋅ Zirui Zhao ⋅ Zhiyuan Hu ⋅ Junzhe Huang ⋅ Amrita Saha ⋅ Zeyuan Chen ⋅ Ran Xu ⋅ Liyuan Pan ⋅ Caiming Xiong ⋅ Junnan Li
Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely \name. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, \name achieves state-of-the-art performance on both grounding and agent task execution benchmarks.
Frozen Priors, Fluid Forecasts: Prequential Uncertainty for Low-Data Deployment with Pretrained Generative Models
Fernando Ruiz-Mazo ⋅ Vikas Garg
Deploying ML systems with only a few real samples makes operational metrics (such as alert rates or mean scores) highly unstable. Existing uncertainty quantification (UQ) methods fail here: frequentist intervals ignore the deployed predictive rule, Bayesian posteriors assume continual refitting, and conformal methods offer per-example rather than long-run guarantees. We introduce a forecast-first UQ framework that blends the empirical distribution with a frozen pretrained generator using a unique Dirichlet schedule, ensuring time-consistent forecasts. Uncertainty is quantified via martingale posteriors: a lightweight, likelihood-free resampling method that simulates future forecasts under the deployed rule, yielding sharp, well-calibrated intervals for both current and long-run metrics without retraining or density evaluation. A single hyperparameter, set by a small-$n$ minimax criterion, balances sampling variance and model--data mismatch; for bounded scores, we provide finite-time drift guarantees. We also show how this framework informs optimal retraining decisions. Applicable off-the-shelf to frozen generators (flows, diffusion, autoregressive models, GANs) and linear metrics (means, tails, NLL), it outperforms bootstrap baselines across vision and language benchmarks (WikiText-2, CIFAR-10, and SVHN datasets); e.g., it achieves $\sim$90\% coverage on GPT-2 with 20 samples vs.\ 37\% for bootstrap. Importantly, our uncertainty estimates are operational under the deployed forecasting rule agnostic of the population parameters, affording practicable estimators for deployment in real world settings. Code available at \url{https://github.com/Aalto-QuML/Prequential/}.
Efficient Testing for Correlation Clustering: Improved Algorithms and Optimal Bounds
Chengyuan Deng ⋅ Jie Gao ⋅ Songhua He ⋅ Chen Wang
Correlation clustering is an important unsupervised learning problem with broad applications. In this problem, we are given a labeled complete graph $G=(V,E^+ \cup E^-)$, and the optimal clustering is defined as a partition of the vertices that minimizes the $+$ edges between clusters and $-$ edges within clusters. We investigate efficient algorithms to test the \emph{cost} of correlation clustering: here, we want to know whether the graph could be (nearly) perfectly clustered (with $0$ cost) or is far away from admitting any perfect clustering. The problem has attracted significant attention aimed at modern large-scale applications, and the state-of-the-art results use $\widetilde{O}({1}/{\varepsilon^7})$ queries and time (up to log factors) to decide whether a graph is perfectly clusterable or needs to flip labels of $\varepsilon {\binom n 2}$ edges to become clusterable. In this paper, we improve this bound significantly by designing an algorithm that uses ${O}({1}/{\varepsilon^2})$ queries and time. Furthermore, we derive the first algorithm that tests the cost for the special setting of correlation clustering with $k$ clusters with ${O}(1/{\varepsilon^4})$ queries and time for constant $k$. Finally, for the special case of $k=2$, which corresponds to the strong structure balance problem in social networks, we obtain tight bounds of $\Theta({1}/{\varepsilon})$ queries -- the first set of \emph{tight} bounds in these problems. We conduct experiments on simulated and real-world datasets, and empirical results demonstrate the advantages of our algorithms.
Learning in high-dimensional MDPs with complex state dynamics became possible with the progress achieved in reinforcement learning research. At the same time, deep neural policies have been observed to be highly unstable with respect to the minor variations in their state space, causing volatile and unpredictable behaviour. To alleviate these volatilities, a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss to ensure local $\epsilon$-invariance in the state space. In this paper, we provide theoretical foundations on the impact of robust, i.e. adversarial, training on reinforcement learning. Our comprehensive theoretical and experimental analysis reveals that standard reinforcement learning inherently learns counterfactual values while recent training techniques that focus on explicitly enforcing $\epsilon$-local invariance cause policies to lose counterfactuality, and further result in learning misaligned and inconsistent values. In connection to this analysis, we further highlight that this line of training methods breaks the core intuition and the true biological inspiration of reinforcement learning, sacrifices essential inherent skills that enable reasoning and generalization, and introduces an intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to AI agents trained via $\epsilon$-local invariance methods. The misalignment, inaccuracy and the loss of counterfactuality revealed in our paper further demonstrate the need to rethink the approach in establishing truly reliable and generalizable reinforcement learning policies.
Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality
Ryotaro Kawata ⋅ Taiji Suzuki
Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over tokens and viewing attention as an integral operator on measures. Concretely, for mixture contexts $\nu = I^{-1} \sum_{i=1}^I \mu^{(i)}$ and a query $x_{\mathrm{q}}(i^\*)$, the task decomposes into (i) recall of the relevant component $\mu^{(i^\*)}$ and (ii) prediction from $(\mu_{i^\*},x_{\mathrm{q}})$. We study learned softmax attention (not a frozen kernel) trained by empirical risk minimization and show that a shallow measure-theoretic Transformer composed with an MLP learns the recall-and-predict map under a spectral assumption on the input densities. We further establish a matching minimax lower bound with the same rate exponent (up to multiplicative constants), proving sharpness of the convergence order. The framework offers a principled recipe for designing and analyzing Transformers that recall from arbitrarily long, distributional contexts with provable generalization guarantees.
Conformalized Survival Counterfactuals Prediction for General Right-Censored Data
Sijie Ren ⋅ Meng Yan ⋅ Zhen Zhang ⋅ Xu Yinghui ⋅ Xinwei Sun
This paper aims to develop a lower prediction bound (LPB) for survival time across different treatments in the general right-censored setting. Although previous methods have utilized conformal prediction to construct the LPB, their resulting prediction sets provide only probably approximately correct (PAC)–type miscoverage guarantees rather than exact ones. To address this problem, we propose a new calibration procedure under the potential outcome framework. Under the strong ignorability assumption, we propose a reweighting scheme that can transform the problem into a weighted conformal inference problem, allowing an LPB to be obtained via quantile regression with an exact miscoverage guarantee. Furthermore, our procedure is doubly robust against model misspecification. Empirical evaluations on synthetic and real-world clinical data demonstrate the validity and informativeness of our constructed LPBs, which indicate the potential of our analytical benchmark for comparing and selecting personalized treatments.
Cartridges: Lightweight and general-purpose long context representations via self-study
Sabri Eyuboglu ⋅ Ryan Ehrlich ⋅ Simran Arora ⋅ Neel Guha ⋅ Dylan Zinsley ⋅ Emily Liu ⋅ Atri Rudra ⋅ James Y Zou ⋅ Azalia Mirhoseini ⋅ Christopher Re
Large language models are often used to answer queries grounded in large text corpora (e.g. codebases, legal documents, or chat histories) by placing the entire corpus in the context window and leveraging in-context learning (ICL). Although current models support contexts of 100K-10M tokens, this setup is costly to serve because the memory consumption of the KV cache scales with input length. We explore an alternative: training a smaller KV cache offline on each corpus. At inference time, we load this trained KV cache, which we call a Cartridge, and decode a response. Critically, the cost of training a Cartridge can be amortized across all the queries referencing the same corpus. However, we find that the naive approach of training the Cartridge with next-token prediction on the corpus is not competitive with ICL. Instead, we propose self-study, a training recipe in which we generate synthetic conversations about the corpus and train the Cartridge with a context-distillation objective. We find that Cartridges trained with self-study replicate the functionality of ICL, while being significantly cheaper to serve. On challenging long-context benchmarks, Cartridges trained with self-study match ICL performance while using 38.6x less memory and enabling 26.4x higher throughput. Self-study also extends the model's effective context length (e.g. from 128k to 484k tokens on MTOB) and surprisingly, leads to Cartridges that can be composed at inference time without retraining.
Newton Method Revisited: Global Convergence Rates up to $O(1/k^3)$ for Stepsize Schedules and Linesearch Procedures
Slavomir Hanzely ⋅ Farshed Abdukhakimov ⋅ Martin Takáč
This paper investigates the global convergence of stepsized Newton methods for convex functions with Hölder continuous Hessians or third derivatives. We propose several simple stepsize schedules with fast global convergence guarantees, up to $\mathcal O( k^{-3} )$. For cases with multiple plausible smoothness parameterizations or an unknown smoothness constant, we introduce a stepsize linesearch and a backtracking procedure with provable convergence as if the optimal smoothness parameters were known in advance. Additionally, we present strong convergence guarantees for the practically popular Newton method with exact linesearch.
We document the rise of the Generative AI Archaeologist, whose tools include linear algebra and probability theory, jailbreaking, and debuggers, compared to the metal detectors, pickaxes, and radar surveys of traditional archaeology. GenAI Archaeologists have reported findings both through luck by observing unexpected behaviour in publicly accessible models, and by exploiting the mathematical properties of models. In this blog, we survey five types of findings unearthed by GenAI Archaeologists and discuss the status of those findings.