Journal Track Posters
NeoBERT: A Next Generation BERT
Lola Le Breton · Quentin Fournier · John X. Morris · Mariam El Mezouar · Sarath Chandar
Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT$_{large}$, RoBERTa$_{large}$, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
Show more
Amortized Inference of Causal Models via Conditional Fixed-Point Iterations
Divyat Mahajan · Jannes Gladrow · Agrin Hilmkil · Cheng Zhang · Meyer Scetbon
Structural Causal Models (SCMs) offer a principled framework to reason about interventions and support out-of-distribution generalization, which are key goals in scientific discovery. However, the task of learning SCMs from observed data poses formidable challenges, and often requires training a separate model for each dataset. In this work, we propose an amortized inference framework that trains a single model to predict the causal mechanisms of SCMs conditioned on their observational data and causal graph. We first use a transformer-based architecture for amortized learning of dataset embeddings, and then extend the Fixed-Point Approach (FiP) to infer the causal mechanisms conditionally on their dataset embeddings. As a byproduct, our method can generate observational and interventional data from novel SCMs at inference time, without updating parameters. Empirical results show that our amortized procedure performs on par with baselines trained specifically for each dataset on both in and out-of-distribution problems, and also outperforms them in scare data regimes.
Show more
Faster Diffusion Through Temporal Attention Decomposition
Haozhe Liu · Wentian Zhang · Jinheng Xie · Francesco Faccio · Mengmeng Xu · Tao Xiang · Mike Zheng Shou · Juan-Manuel Perez-Rua · Jürgen Schmidhuber
We explore the role of the attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. Self-attention, however, initially plays a minor role but becomes increasingly important in the second phase. These findings yield a simple and training-free method called TGATE which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experiments show TGATE’s broad applicability to various existing text-conditional diffusion models which it speeds up by 10-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
Show more
Temporal Test-Time Adaptation with State-Space Models
Mona Schirmer · Dan Zhang · Eric Nalisnick
Distribution shifts between training and test data are inevitable over the lifecycle of a deployed model, leading to performance decay. Adapting a model on test samples can help mitigate this drop in performance. However, most test-time adaptation methods have focused on synthetic corruption shifts, leaving a variety of distribution shifts underexplored. In this paper, we focus on distribution shifts that evolve gradually over time, which are common in the wild but challenging for existing methods, as we show. To address this, we propose STAD, a Bayesian filtering method that adapts a deployed model to temporal distribution shifts by learning the time-varying dynamics in the last set of hidden features. Without requiring labels, our model infers time-evolving class prototypes that act as a dynamic classification head. Through experiments on real-world temporal distribution shifts, we show that our method excels in handling small batch sizes and label shift.
Show more
Generalized Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models
Ling-Qi Zhang · Zahra Kadkhodaie · Eero P Simoncelli · David H Brainard
We examine the problem of selecting a small set of linear measurements for reconstructing high-dimensional signals. Well-established methods for optimizing such measurements include principal component analysis (PCA), independent component analysis (ICA) and compressed sensing (CS) based on random projections, all of which rely on axis- or subspace-aligned statistical characterization of the signal source. However, many naturally occurring signals, including photographic images, contain richer statistical structure. To exploit such structure, we introduce a general method for obtaining an optimized set of linear measurements for efficient image reconstruction, where the signal statistics are expressed by the prior implicit in a neural network trained to perform denoising (known as a ``diffusion model''). We demonstrate that the optimal measurements derived for two natural image datasets differ from those of PCA, ICA, or CS, and result in substantially lower mean squared reconstruction error. Interestingly, the marginal distributions of the measurement values are asymmetrical (skewed), substantially more so than those of previous methods. We also find that optimizing with respect to perceptual loss, as quantified by structural similarity (SSIM), leads to measurements different from those obtained when optimizing for MSE. Our results highlight the importance of incorporating the specific statistical regularities of natural signals when designing effective linear measurements.
Show more
What is the Relationship between Tensor Factorizations and Circuits (and How Can We Exploit it)?
Lorenzo Loconte · Antonio Mari · Gennaro Gala · Robert Peharz · Cassio de Campos · Erik Quaeghebeur · Gennaro Vessio · Antonio Vergari
This paper establishes a rigorous connection between circuit representations and tensor factorizations, two seemingly distinct yet fundamentally related areas. By connecting these fields, we highlight a series of opportunities that can benefit both communities. Our work generalizes popular tensor factorizations within the circuit language, and unifies various circuit learning algorithms under a single, generalized hierarchical factorization framework. Specifically, we introduce a modular “Lego block” approach to build tensorized circuit architectures. This, in turn, allows us to systematically construct and explore various circuit and tensor factorization models while maintaining tractability. This connection not only clarifies similarities and differences in existing models, but also enables the development of a comprehensive pipeline for building and optimizing new circuit/tensor factorization architectures. We show the effectiveness of our framework through extensive empirical evaluations, and highlight new research opportunities for tensor factorizations in probabilistic modeling.
Show more
SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis
Jipeng Lyu · Jiahua Dong · Yu-Xiong Wang
Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging, particularly due to the complexity of capturing accurate deformations while maintaining computational efficiency. In this paper, we present SCas4D, a novel cascaded optimization framework that leverages inherent structural patterns in 3D Gaussian Splatting (3DGS) for dynamic scenes. Our key insight is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians undergo similar transformations. By employing a structural cascaded optimization approach that progressively refines deformations from coarse part-level to fine point-level adjustments, SCas4D achieves convergence within 100 iterations per time frame while maintaining competitive quality to the state-of-the-art method with only 1/20th of the training iterations. We further demonstrate our method's effectiveness in self-supervised articulated object segmentation, establishing a natural capability from our representation. Extensive experiments demonstrate our method's effectiveness in novel view synthesis and dense point tracking tasks. Please find our project page at https://github-tree-0.github.io/SCas4D-project-page/.
Show more
A Case for Library-Level k-Means Binning in Histogram Gradient-Boosted Trees
Asher Labovich
Modern Gradient Boosted Decision Trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from $O(N\log N)$ to $O(N)$ by aggregating gradients into fixed-size bins. However, the predominant quantile binning strategy—designed to distribute data points evenly among bins—may overlook critical boundary values that could enhance predictive performance. In this work, we consider a novel approach that replaces quantile binning with a $k$-means discretizer initialized with quantile bins, and justify the swap with a proof showing how, for any $L$-Lipschitz function, k-means maximizes the worst-case explained variance of Y obtained when treating all values in a given bin as equivalent. We test this swap against quantile and uniform binning on 33 OpenML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets, k-means shows no statistically significant losses at the 5% level and wins in three cases—most strikingly a 55% MSE drop on one particularly skewed dataset—even though k-means' mean reciprocal rank (MRR) is slightly lower (0.65 vs 0.72). On the 15 classification datasets the two methods are statistically tied (MRR 0.70 vs 0.68) with gaps $\leq$0.2 pp. Synthetic experiments confirm consistently large MSE gains—typically $>$20% and rising to 90% as outlier magnitude increases or bin budget drops. We find that k-means keeps error on par with exhaustive (no-binning) splitting when extra cuts add little value, yet still recovers key split points that quantile overlooks. As such, we advocate for a built-in bin_method=$k$-means flag, especially in regression tasks and in tight-budget settings such as the 32–64-bin GPU regime—because it is a "safe default" with large upside, yet adds only a one-off, cacheable overhead ($\approx$ 3.5s per feature to bin 10M rows on one Apple M1 thread).
Show more
An Expanded Benchmark that Rediscovers and Affirms the Edge of Uncertainty Sampling for Active Learning in Tabular Datasets
Po-Yi Lu · Yi-Jie Cheng · Chun-Liang Li · Hsuan-Tien Lin
Active Learning (AL) addresses the crucial challenge of enabling machines to efficiently gather labeled examples through strategic queries. Among the many AL strategies, Uncertainty Sampling (US) stands out as one of the most widely adopted. US queries the example(s) that the current model finds uncertain, proving to be both straightforward and effective. Despite claims in the literature suggesting superior alternatives to US, community-wide acceptance remains elusive. In fact, existing benchmarks for tabular datasets present conflicting conclusions on the continued competitiveness of US. In this study, we review the literature on AL strategies in the last decade and build the most comprehensive open-source AL benchmark to date to understand the relative merits of different AL strategies. The benchmark surpasses existing ones by encompassing a broader coverage of strategies, models, and data. Through our investigation of the conflicting conclusions in existing tabular AL benchmarks by evaluation under broad AL experimental settings, we uncover fresh insights into the often-overlooked issue of using machine learning models--**model compatibility** in the context of US. Specifically, we notice that adopting the different models for the querying unlabeled examples and learning tasks would degrade US's effectiveness. Notably, our findings affirm that US maintains a competitive edge over other strategies when paired with compatible models. These findings have practical implications and provide a concrete recipe for AL practitioners, empowering them to make informed decisions when working with tabular classifications with limited labeled data. The code for this project is available on https://github.com/ariapoy/active-learning-benchmark.
Show more
back arrowGo to TMLR homepage Slicing the Gaussian Mixture Wasserstein Distance
Moritz Piening · Robert Beinert
Gaussian mixture models (GMMs) are widely used in machine learning for tasks such as clustering, classification, image reconstruction, and generative modeling. A key challenge in working with GMMs is defining a computationally efficient and geometrically meaningful metric. The mixture Wasserstein (MW) distance adapts the Wasserstein metric to GMMs and has been applied in various domains, including domain adaptation, dataset comparison, and reinforcement learning. However, its high computational cost—arising from repeated Wasserstein distance computations involving matrix square root estimations and an expensive linear program—limits its scalability to high-dimensional and large-scale problems. To address this, we propose multiple novel slicing-based approximations to the MW distance that significantly reduce computational complexity while preserving key optimal transport properties. From a theoretical viewpoint, we establish several weak and strong equivalences between the introduced metrics, and show the relations to the original MW distance and the well-established sliced Wasserstein distance. Furthermore, we validate the effectiveness of our approach through numerical experiments, demonstrating computational efficiency and applications in clustering, perceptual image comparison, and GMM minimization
Show more
Celo: Training Versatile Learned Optimizers on a Compute Diet
Abhinav Moudgil · Boris Knyazev · Guillaume Lajoie · Eugene Belilovsky
Learned optimization has emerged as a promising alternative to hand-crafted optimizers, with the potential to discover stronger learned update rules that enable faster, hyperparameter-free training of neural networks. A critical element for practically useful learned optimizers, that can be used off-the-shelf after meta-training, is strong meta-generalization: the ability to apply the optimizers to new tasks. Recent state-of-the-art work in learned optimizers, VeLO (Metz et al., 2022), requires a large number of highly diverse meta-training tasks along with massive computational resources, 4000 TPU months, to achieve meta-generalization. This makes further improvements to such learned optimizers impractical. In this work, we identify several key elements in learned optimizer architectures and meta-training procedures that can lead to strong meta-generalization. We also propose evaluation metrics to reliably assess quantitative performance of an optimizer at scale on a set of evaluation tasks. Our proposed approach, Celo, makes a significant leap in improving the meta-generalization performance of learned optimizers and also outperforms tuned state-of-the-art optimizers on a diverse set of out-of-distribution tasks, despite being meta-trained for just 24 GPU hours.
Show more
FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness
Vincent Abbott · Gioele Zardini
Optimizing deep learning algorithms currently requires slow, manual derivation, potentially leaving much performance untapped. Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers, but required three iterations over three years to be developed. Automated compiled methods have consistently lagged behind. This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy. We show how diagrams can use simple relabellings to derive high-level streaming and tiling optimization strategies along with performance models. We show how this high-level performance model allows the effects of quantization and multi-level GPU hierarchies to be readily considered. We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step. Finally, we show how our methodology can be used to better understand existing techniques like FlashAttention. This work uses a theoretical framework to link assumptions about GPU behaviour to claims about performance. We aim to lay the groundwork for a scientific approach to GPU optimization where experiments can address clear hypotheses rather than post-hoc rationalizations.
Show more
Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
Aleksandr Dremov · Alexander Hägele · Atli Kosson · Martin Jaggi
Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations — comparable to those from cooldown shape selection — when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $\beta_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.
Show more
MobileCLIP2: Improving Multi-Modal Reinforced Training
Fartash Faghri · Pavan Kumar Anasosalu Vasu · Cem Koc · Vaishaal Shankar · Alexander T Toshev · Oncel Tuzel · Hadi Pouransari
Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2× smaller and improves on DFN ViT-L/14 at 2.5× lower latency. We release our pretrained models and the data generation code. The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
Show more
CNN Interpretability with Multivector Tucker Saliency Maps for Self-Supervised Models
Aymene Mohammed Bouayed · Samuel Deslauriers-gauthier · Adrian IACOVELLI · David Naccache
Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet it remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on reference labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants—Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)—which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances interpretability by approximately $50\%$ over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art interpretability performance on self-supervised models, with MTSM achieving the best results.
Show more
Chimera: State Space Models Beyond Sequences
Aakash Lahoti · Tanya Marwah · Ratish Puduppully · Albert Gu
Transformer-based deep learning methods have emerged as the standard approach to model diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires the use of inductive biases, such as position embeddings in sequences and images, and random walks in graphs, to incorporate topology. However, developing bespoke inductive biases for each task requires significant effort and can also introduce side-effects hindering generalization. In this work, we introduce Chimera, a unified model that directly incorporates the data topology in a principled way, obviating the need for domain-specific biases. Central to Chimera is the observation that state-space models---which naturally do not require position embeddings---can be generalized to capture any general graph topology. Our experiments demonstrate the versatility of our approach---Chimera achieves strong performance across the domains of language, vision, and graphs, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all the baselines on the Long Range Graph Benchmark. Our results validate Chimera's principled methodological contributions and affirm the long-held belief that data topology is a powerful inductive bias across modalities. We further propose algorithmic optimizations to improve Chimera's efficiency while maintaining performance: 1) For the subclass of Directed Acyclic Graphs we show that Chimera can be implemented as a linear time recurrence. 2) For general graphs, we relax the method with a simple mathematical approximation, achieving Transformer's quadratic complexity without relying on domain-specific biases.
Show more
Setting the Record Straight on Transformer Oversmoothing
Gbètondji J-S Dovonon · Michael Bronstein · Matt J. Kusner
Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.
Show more
Encoder-only Next Token Prediction
Ethan Ewer · Daewon Chae · Thomas Zeng · Jinkyu Kim · Kangwook Lee
Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $\operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.
Show more
DNOD: Deformable Neural Operators for Object Detection in SAR Images
GVS Mothish · J Rishi · Shobhit Kumar Shukla · Deepak Subramani
We introduce a deep neural operator framework aimed at object detection in remotely sensed Synthetic Aperture Radar (SAR) images. Recent research highlights the impressive performance of the End-to-End Object Detection Transformer (DETR). Nonetheless, in domains like SAR imaging, managing challenges such as speckle noise and the detection of small objects continues to be problematic. To address SAR object detection issues, we present the Deformable Neural Operator-Based Object Detection (DNOD) framework, tailored for SAR tasks. We develop two neural operators: Multi-Scale Fourier Mixing (MSFM) for the encoder and Multi-scale, multi-input Adaptive Deformable Fourier Neural Operator (MADFNO) for the decoder. Detailed evaluations and ablation studies show that DNOD exceeds existing methods, delivering significantly better results with an improvement of +2.23 mAP on the SARDet-100k dataset, the largest SAR object detection compilation. The code is available at https://github.com/quest-lab-iisc/DNOD.
Show more
Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis
Sonia Cromp · Satya Sai Srinath Namburi GNVV · Mohammed Alkhudhayri · Catherine Cao · Samuel Guo · Nicholas Roberts · Frederic Sala
Large language models (LLMs) have greatly improved the quality of synthetic text data. We aim to extend these advances to tabular data with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby represents differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. Pairing Tabby with Plain, our novel tabular training technique, we observe up to a $7\%$ improvement in quality (measured by MLE) over previous methods. Additionally, our approach is more flexible than prior strategies and extends beyond tables, to more general structured data. In a structured JSON setting, Tabby outperforms all other methods by $2$-$3$ points and is the only approach with MLE equal to the upper bound of non-synthetic data.
Show more
HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection
Min-Hsuan Yeh · Max Kamachee · Seongheon Park · Yixuan Li
To mitigate the impact of hallucination nature of LLMs, many studies propose detecting hallucinated generation through uncertainty estimation. However, these approaches predominantly operate at the sentence or paragraph level, failing to pinpoint specific spans or entities responsible for hallucinated content. This lack of granularity is especially problematic for long-form outputs that mix accurate and fabricated information. To address this limitation, we explore entity-level hallucination detection. We propose a new data set, HalluEntity, which annotates hallucination at the entity level. Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Through an in-depth qualitative study, we identify relationships between hallucination tendencies and linguistic properties and highlight important directions for future research.
HalluEntity: https://huggingface.co/datasets/samuelyeh/HalluEntity
Show more
Adversarial Robustness of Graph Transformers
Philipp Foth · Lukas Gosch · Simon Geisler · Leo Schwinn · Stephan Günnemann
Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. We close this gap and design the first adaptive attacks for GTs. In particular, we provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on pairwise shortest paths, random walks, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification, and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.
Show more
Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching
Junn Yong Loo · Fang Yu Leong · Michelle Adeline · Julia Kaiwen Lau · Hwa Hui Tew · Arghya Pal · Vishnu Monn Baskaran · Chee-Ming Ting · Raphaël C.-W. Phan
Energy-based models (EBMs) are a powerful class of probabilistic generative models due to their flexibility and interpretability. However, relationships between potential flows and explicit EBMs remain underexplored, while contrastive divergence training via implicit Markov chain Monte Carlo (MCMC) sampling is often unstable and expensive in high-dimensional settings. In this paper, we propose Variational Potential (VAPO) Flow Bayes, a new energy-based generative framework that eliminates the need for implicit MCMC sampling and does not rely on auxiliary networks or cooperative training. VAPO learns an energy-parameterized potential flow by constructing a flow-driven density homotopy that is matched to the data distribution through a variational loss minimizing the Kullback-Leibler divergence between the flow-driven and marginal homotopies. This principled formulation enables robust and efficient generative modeling while preserving the interpretability of EBMs. Experimental results on image generation, interpolation, out-of-distribution detection, and compositional generation confirm the effectiveness of VAPO, showing that our method performs competitively with existing approaches in terms of sample quality and versatility across diverse generative modeling tasks.
Show more
Assessing Robustness via Score-Based Adversarial Image Generation
Marcel Kollovieh · Lukas Gosch · Marten Lienen · Yan Scholten · Leo Schwinn · Stephan Günnemann
Most adversarial attacks and defenses focus on perturbations within small $\ell_p$-norm constraints. However, $\ell_p$ threat models cannot capture all relevant semantics-preserving perturbations, and hence, the scope of robustness evaluations is limited. In this work, we introduce Score-Based Adversarial Generation (ScoreAG), a novel framework that leverages the advancements in score-based generative models to generate unrestricted adversarial examples that overcome the limitations of $\ell_p$-norm constraints. Unlike traditional methods, ScoreAG maintains the core semantics of images while generating adversarial examples, either by transforming existing images or synthesizing new ones entirely from scratch. We further exploit the generative capability of ScoreAG to purify images, empirically enhancing the robustness of classifiers. Our extensive empirical evaluation demonstrates that ScoreAG improves upon the majority of state-of-the-art attacks and defenses across multiple benchmarks. This work highlights the importance of investigating adversarial examples bounded by semantics rather than $\ell_p$-norm constraints. ScoreAG represents an important step towards more encompassing robustness assessments.
Show more
Inverse Scaling in Test-Time Compute
Aryo Pradipta Gema · Alexander Hägele · Runjin Chen · Andy Arditi · Jacob Goldman-Wetzler · Kit Fraser-Taliente · Henry Sleight · Linda Petrini · Julian Michael · Beatrice Alex · Pasquale Minervini · Yanda Chen · Joe Benton · Ethan Perez
We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.
Show more
ODNet: Opinion Dynamics-Inspired Neural Message Passing for Graphs and Hypergraphs
Bingxin Zhou · Outongyi Lv · Jing Wang · Xiang Xiao · Weishu Zhao
Neural message passing serves as a cornerstone framework in graph neural networks, providing a clear and intuitive mathematical guideline for the propagation and aggregation of information among interconnected nodes within graphs. Throughout this process, node representations undergo dynamic updates, considering both the individual states and connections of neighboring nodes. Concurrently, social networks, as prominent forms of interconnected data, form dynamic systems that achieve stability through continuous internal communications and opinion exchanges among social actors along their social ties. Drawing upon the shared concepts between these two domains, our study establishes an explicit connection between message passing and opinion dynamics in sociology. Moreover, we introduce a novel continuous message passing scheme termed ODNet, which integrates bounded confidence to refine the influence weight of local nodes for message propagation. By adjusting the similarity cutoffs of bounded confidence and influence weights within ODNet, we define opinion exchange rules that align with the characteristics of neural message passing and can effectively mitigate the oversmoothing issue. We extend the framework to hypergraphs and formulate corresponding continuous message passing rules, which reveal a close association with particle dynamics. Empirically, we showcase that ODNet enhances prediction performance across various social networks presented as homophilic graphs, heterophilic graphs, and hypergraphs. Notably, our proposed ODNet outperforms existing GNNs with its straightforward construction and robust theoretical foundation.
Show more
Adaptive Mesh Quantization for Neural PDE Solvers
Winfried van den Dool · Maksim Zhdanov · Yuki M. Asano · Max Welling
Physical systems commonly exhibit spatially varying complexity, presenting a significant challenge for neural PDE solvers. While Graph Neural Networks can handle the irregular meshes required for complex geometries and boundary conditions, they still apply uniform computational effort across all nodes regardless of the underlying physics complexity. This leads to inefficient resource allocation where computationally simple regions receive the same treatment as complex phenomena. We address this challenge by introducing Adaptive Mesh Quantization: spatially adaptive quantization across mesh node, edge and cluster features, dynamically adjusting the bit-width used by a quantized model.
We propose an adaptive bit-width allocation strategy driven by a lightweight auxiliary model that identifies high-loss regions in the input mesh. This enables dynamic resource distribution in the main model, where regions of higher difficulty are allocated increased bit-width, optimizing computational resource utilization. We demonstrate our framework's effectiveness by integrating it with two state-of-the-art models, MP-PDE and GraphViT, to evaluate performance across multiple tasks: 2D Darcy flow, large-scale unsteady fluid dynamics in 2D, steady-state Navier–Stokes simulations in 3D, and a 2D hyper-elasticity problem. Our framework demonstrates consistent Pareto improvements over uniformly quantized baselines, yielding up to 50\% improvements in performance at the same cost.
Show more
An Information-Theoretic Lower Bound on the Generalization Error of Autoencoders
Shyam Venkatasubramanian · Sean Moushegian · Ahmed Aloui · Vahid Tarokh
Quantifying the limitations of classical neural network architectures is a critically underexplored area of machine learning research. Deriving lower bounds on the optimal performance of these architectures can facilitate improved neural architecture search and overfitting detection. We present an information-theoretic lower bound on the generalization mean squared error of autoencoders with sigmoid activation functions. Through the Estimation Error and Differential Entropy (EEDE) inequality for continuous random vectors, we derive this lower bound, which provides a new perspective on the inherent limitations and capabilities of autoencoders. Our analysis extends to the examination of how this lower bound is influenced by various architectural features and data distribution characteristics. This study enriches our theoretical understanding of autoencoders and has substantial practical implications for their design, optimization, and application in the field of deep learning.
Show more
On the Ability of Deep Networks to Learn Symmetries from Data – A Neural Kernel Theory
Andrea Perin · Stephane Deny
Symmetries (transformations by group actions) are present in many datasets, and leverag- ing them holds considerable promise for improving predictions in machine learning. In this work, we aim to understand when and how deep networks—with standard architectures trained in a standard, supervised way— learn symmetries from data. Inspired by real-world scenarios, we study a classification paradigm where data symmetries are only partially ob- served during training: some classes include all transformations of a cyclic group, while others—only a subset. We ask: under which conditions will deep networks correctly classify the partially sampled classes? In the infinite-width limit, where neural networks behave like kernel machines, we derive a neural kernel theory of symmetry learning . The group-cyclic nature of the dataset allows us to analyze the Gram matrix of neural kernels in the Fourier domain; here we find a simple characterization of the generalization error as a function of class separation (signal) and class-orbit density (noise). This characterization reveals that generalization can only be successful when the local structure of the data prevails over its non-local, symmetry- induced structure, in the kernel space defined by the architecture. This occurs when (1) classes are sufficiently distinct and (2) class orbits are sufficiently dense. We extend our theoretical treatment to any finite group, including non-abelian groups. Our framework also applies to equivariant architectures (e.g., CNNs), and recovers their success in the special case where the architecture matches the inherent symmetry of the data. Empirically, our theory reproduces the generalization failure of finite-width networks (MLP, CNN, ViT) trained on partially observed versions of rotated-MNIST. We conclude that conventional deep networks lack a mechanism to learn symmetries that have not been explicitly embedded in their architecture a priori . In the future, our framework could be extended to guide the design of architectures and training procedures able to learn symmetries from data. All code is available at https://github.com/Andrea-Perin/gpsymm .
Show more
ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models
Yaswanth Narsupalli · Abhranil Chandra · Sreevatsa Muppirala · Manish Gupta · Pawan Goyal
Assessing the quality of generative model outputs from large language models (LLMs) or vision-language models (VLMs), poses significant challenges. Traditional evaluation methods either rely on human assessment which is resource-intensive and not scalable or on automatic metrics that often correlate poorly with human preferences. Another approach is to train dedicated neural evaluators, but this typically requires substantial training data and compute. In this study, we thus introduce ReFeR, a tuning-free framework for evaluating generative outputs including both text and images, using a two-level hierarchy of pre-trained LLM and VLM evaluators. This multi-agent hierarchical strategy leverages additional compute at inference time by orchestrating multiple models and utilizing the increased test-time reasoning to boost performance. By having models themselves provide feedback and final judgments, ReFeR reduces the dependence on human evaluation. We rigorously evaluate ReFeR on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, ReFeR is also applicable for reasoning tasks - experiments on four reasoning benchmarks show ReFeR’s superior collective reasoning abilities. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more test-time compute efficient solution. ReFeR-Lite is $\sim12-14\times$ more compute efficient than previous works while being comparably accurate to ReFeR-Turbo.
Show more
PCF Learned Sort: a Learning Augmented Sort Algorithm with O(nloglogn) Expected Complexity
Atsuki Sato · Yusuke Matsui
Sorting is one of the most fundamental algorithms in computer science. Recently, Learned Sorts, which use machine learning to improve sorting speed, have attracted attention. While existing studies show that Learned Sort is empirically faster than classical sorting algorithms, they do not provide theoretical guarantees about its computational complexity. We propose Piecewise Constant Function (PCF) Learned Sort, a theoretically guaranteed Learned Sort algorithm. We prove that the expected complexity of PCF Learned Sort is $\mathcal{O}(n \log \log n)$ under mild assumptions on the data distribution. We also confirm empirically that PCF Learned Sort has a computational complexity of $\mathcal{O}(n \log \log n)$ on both synthetic and real datasets. This is the first study to theoretically support the empirical success of Learned Sort, and provides evidence for why Learned Sort is fast.
Show more
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Or Tal · Felix Kreuk · Yossi Adi
Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices.
This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm.
We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems.
Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional flow-matching.
We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures.
Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting.
This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation.
Audio sampled examples are available at: https://huggingface.co/spaces/ortal1602/ARvsFM
Show more
Distributed Quasi-Newton Method for Fair and Fast Federated Learning
Shayan Mohajer Hamidi · Linfeng Ye
Federated learning (FL) is a promising technology that enables edge devices/clients to collaboratively and iteratively train a machine learning model under the coordination of a central server. The most common approach to FL is first-order methods, where clients send their local gradients to the server in each iteration. However, these methods often suffer from slow convergence rates. As a remedy, second-order methods, such as quasi-Newton, can be employed in FL to accelerate its convergence. Unfortunately, similarly to the first-order FL methods, the application of second-order methods in FL can lead to unfair models, achieving high average accuracy while performing poorly on certain clients' local datasets. To tackle this issue, in this paper we introduce a novel second-order FL framework, dubbed distributed quasi-Newton federated learning (DQN-Fed). This approach seeks to ensure fairness while leveraging the fast convergence properties of quasi-Newton methods in the FL context. Specifically, DQN-Fed helps the server update the global model in such a way that (i) all local loss functions decrease to promote fairness, and (ii) the rate of change in local loss functions aligns with that of the quasi-Newton method. We prove the convergence of DQN-Fed and demonstrate its \textit{linear-quadratic} convergence rate. Moreover, we validate the efficacy of DQN-Fed across a range of federated datasets, showing that it surpasses state-of-the-art fair FL methods in fairness, average accuracy and convergence speed. The Code for paper is publicly available at \url{https://anonymous.4open.science/r/DQN-Fed-FDD2}.
Show more
Training Dynamics of Learning 3D-Rotational Equivariance
Max W. Shen · Ewa Nowara · Michael Maser · Kyunghyun Cho
While data augmentation is widely used to train symmetry-agnostic models, it remains unclear how quickly and effectively they learn to respect symmetries. We investigate this by deriving a principled measure of equivariance error that, for convex losses, calculates the percent of total loss attributable to imperfections in learned symmetry. We focus our empirical investigation to 3D-rotation equivariance on high-dimensional molecular tasks (flow matching, force field prediction, denoising voxels) and find that models reduce equivariance error quickly to $\leq$2\% held-out loss within 1k-10k training steps, a result robust to model and dataset size. This happens because learning 3D-rotational equivariance is an easier learning task, with a smoother and better-conditioned loss landscape, than the main prediction task. For 3D rotations, the loss penalty for non-equivariant models is small throughout training, so they may achieve lower test loss than equivariant models per GPU-hour unless the equivariant ``efficiency gap'' is narrowed. We also experimentally and theoretically investigate the relationships between relative equivariance error, learning gradients, and model parameters.
Show more
Statistical Guarantees for Approximate Stationary Points of Shallow Neural Networks
Mahsa Taheri · Fang Xie · Johannes Lederer
Since statistical guarantees for neural networks are usually restricted to global optima of intricate objective functions, it is unclear whether these theories explain the performances of actual outputs of neural network pipelines. The goal of this paper is, therefore, to bring statistical theory closer to practice. We develop statistical guarantees for shallow linear neural networks that coincide up to logarithmic factors with the global optima but apply to stationary points and the points nearby. These results support the common notion that neural networks do not necessarily need to be optimized globally from a mathematical perspective. We then extend our statistical guarantees to shallow ReLU neural networks, assuming the first layer weight matrices are nearly identical for the stationary network and the target. More generally, despite being limited to shallow neural networks for now, our theories make an important step forward in describing the practical properties of neural networks in mathematical terms.
Show more
Synergistic Benefits of Joint Molecule Generation and Property Prediction
Adam Izdebski · Jan Olszewski · Pankhil Gawade · Krzysztof Koras · Serra Korkmaz · Valentin Rauscher · Jakub M. Tomczak · Ewa Szczurek
Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.
Show more
Dynamics-inspired Structure Hallucination for Protein-protein Interaction Modeling
Fang Wu · Stan Z. Li
Protein-protein interaction (PPI) represents a central challenge within the biology field, and accurately predicting the consequences of mutations in this context is crucial for drug design and protein engineering. Deep learning (DL) has shown promise in forecasting the effects of such mutations but is hindered by two primary constraints. First, the structures of mutant proteins are often elusive to acquire. Secondly, PPI takes place dynamically, which is rarely integrated into the DL architecture design. To address these obstacles, we present a novel framework named Refine-PPI with two key enhancements. First, we introduce a structure refinement module trained by a mask mutation modeling (MMM) task on available wild-type structures, which is then transferred to hallucinate the inaccessible mutant structures. Second, we employ a new kind of geometric network, called the probability density cloud network (PDC-Net), to capture 3D dynamic variations and encode the atomic uncertainty associated with PPI. Comprehensive experiments on SKEMPI.v2 substantiate the superiority of Refine-PPI over all existing tools for predicting free energy change. These findings underscore the effectiveness of our hallucination strategy and the PDC module in addressing the absence of mutant protein structure and modeling geometric uncertainty.
Show more
LC-PLM: Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers
Yingheng Wang · Zichen Wang · Gil Sadeh · Luca Zancato · Alessandro Achille · George Karypis · Huzefa Rangwala
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models, to learn high-quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph-contextual variant, LC-PLM, which contextualizes protein-protein interaction (PPI) graphs for a second stage of training. LC-PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and up to 30% and 16% improvements on protein downstream tasks compared to Transformer-based ESM-2 when trained with 100B and 1T tokens, respectively. LC-PLM-G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g., structured state space models) in learning universal protein representations and incorporating molecular interaction contexts contained in biological graphs. Model is available at github.com/amazon-science/LC-PLM.
Show more
On the stability of gradient descent with second order dynamics for time-varying cost functions
Travis E Gibson · Sawal Acharya · Anjali Parashar · Joseph Emilio Gaudio · Anuradha Annaswamy
Gradient based optimization algorithms deployed in Machine Learning (ML) applications are often analyzed and compared by their convergence rates or regret bounds. While these rates and bounds convey valuable information they don’t always directly translate to stability guarantees. Stability and similar concepts, like robustness, will become ever more important as we move towards deploying models in real-time and safety critical systems. In this work we build upon the results in Gaudio et al. 2021 and Moreu & Annaswamy 2022 for gradient descent with second order dynamics when applied to explicitly time varying cost functions and provide more general stability guarantees. These more general results can aid in the design and certification of these optimization schemes so as to help ensure safe and reliable deployment for real-time learning applications. We also hope that the techniques provided here will stimulate and cross-fertilize the analysis that occurs on the same algorithms from the online learning and stochastic optimization communities.
Show more
Variational Pseudo Marginal Methods for Jet Reconstruction in Particle Physics
Hanming Yang · Antonio Khalil Moretti · Sebastian Macaluso · Philippe Chlenski · Christian A. Naesseth · Itsik Pe'er
Reconstructing jets, which provide vital insights into the properties and histories of subatomic particles produced in high-energy collisions, is a main problem in data analyses of collider physics. This intricate task deals with estimating the latent structure of a jet (binary tree) and involves parameters such as particle energy, momentum, and types. While Bayesian methods offer a natural approach for handling uncertainty and leveraging prior knowledge, they face significant challenges due to the super-exponential growth of potential jet topologies as the number of observed particles increases. To address this, we introduce a Combinatorial Sequential Monte Carlo approach for inferring jet latent structures. As a second contribution, we leverage the resulting estimator to develop a variational inference algorithm for parameter learning. Building on this, we introduce a variational family using a pseudo-marginal framework for a fully Bayesian treatment of all variables, unifying the generative model with the inference process. We illustrate our method's effectiveness through experiments using data generated with a collider physics generative model, highlighting superior speed and accuracy across a range of tasks.
Show more
Directed Exploration in Reinforcement Learning from Linear Temporal Logic
Marco Bagatella · Andreas Krause · Georg Martius
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning, as it allows describing objectives beyond the expressivity of conventional discounted return formulations. Nonetheless, recent works have shown that LTL formulas can be translated into a variable rewarding and discounting scheme, whose optimization produces a policy maximizing a lower bound on the probability of formula satisfaction. However, the synthesized reward signal remains fundamentally sparse, making exploration challenging. We aim to overcome this limitation, which can prevent current algorithms from scaling beyond low-dimensional, short-horizon problems. We show how better exploration can be achieved by further leveraging the LTL specification and casting its corresponding Limit Deterministic Büchi Automaton (LDBA) as a Markov reward process, thus enabling a form of high-level value estimation. By taking a Bayesian perspective over LDBA dynamics and proposing a suitable prior distribution, we show that the values estimated through this procedure can be treated as a shaping potential and mapped to informative intrinsic rewards. Empirically, we demonstrate applications of our method from tabular settings to high-dimensional continuous systems, which have so far represented a significant challenge for LTL-based reinforcement learning algorithms.
Show more
Multi-Bellman operator for convergence of Q-learning with linear function approximation
Diogo S. Carvalho · Pedro A. Santos · Francisco S. Melo
We investigate the convergence of $Q$-learning with linear function approximation and introduce the multi-Bellman operator, an extension of the traditional Bellman operator. By analyzing the properties of this operator, we identify conditions under which the projected multi-Bellman operator becomes a contraction, yielding stronger fixed-point guarantees compared to the original Bellman operator. Building on these insights, we propose the multi-$Q$-learning algorithm, which achieves convergence and approximates the optimal solution with arbitrary precision. This contrasts with traditional $Q$-learning, which lacks such convergence guarantees. Finally, we empirically validate our theoretical results.
Show more
Iterated Q-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning
Théo Vincent · Daniel Palenicek · Boris Belousov · Jan Peters · Carlo D'Eramo
The vast majority of Reinforcement Learning methods is largely impacted by the computation effort and data requirements needed to obtain effective estimates of action-value functions, which in turn determine the quality of the overall performance and the sample-efficiency of the learning procedure. Typically, action-value functions are estimated through an iterative scheme that alternates the application of an empirical approximation of the Bellman operator and a subsequent projection step onto a considered function space. It has been observed that this scheme can be potentially generalized to carry out multiple iterations of the Bellman operator at once, benefiting the underlying learning algorithm. However, until now, it has been challenging to effectively implement this idea, especially in high-dimensional problems. In this paper, we introduce iterated $Q$-Network (i-QN), a novel principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions where each serves as the target for the next. We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate the advantages of i-QN in Atari $2600$ games and MuJoCo continuous control problems.
Show more
Defending Against Unknown Corrupted Agents: Reinforcement Learning of Adversarially Robust Nash Equilibria
Andi Nika · Jonathan Nöther · Adish Singla · Goran Radanović
We consider a Multi-agent Reinforcement Learning (MARL) setting, in which an attacker can arbitrarily corrupt any subset of up to $k$ out of $n$ agents at deployment. Our goal is to design agents that are robust against such an attack, by accounting for the presence of corrupted agents at test time. To that end, we introduce a novel solution concept, the Adversarially Robust Nash Equilibrium (ARNEQ), and provide theoretical proof of its existence in general-sum Markov games. Furthermore, we introduce a proof-of-concept model-based approach to computing it and theoretically prove its convergence under standard assumptions. We also present a practical approach called Adversarially Robust Training (ART), an independent learning algorithm based on stochastic gradient descent ascent. Our experiments in both cooperative and mixed cooperative-competitive environments demonstrate ART's effectiveness and practical value in enhancing MARL resilience against adversarial behavior.
Show more
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs
Jialin Yang · Dongfu Jiang · Tony He · Sherman Siu · Yuxuan Zhang · Disen Liao · Zhuofeng Li · Huaye Zeng · Yiming Jia · Haozhe Wang · Benjamin Schneider · Chi Ruan · Wentao Ma · Zhiheng Lyu · Yifei Wang · Yi Lu · Quy Duc Do · Ziyan Jiang · Ping Nie · Wenhu Chen
As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.
Show more
Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?
Olawale Elijah Salaudeen · Nicole Chiou · Shiny Weng · Sanmi Koyejo
Spurious correlations, unstable statistical shortcuts a model can exploit, are expected to degrade performance out-of-distribution (OOD). However, across many popular OOD generalization benchmarks, vanilla empirical risk minimization (ERM) often achieves the highest OOD accuracy. Moreover, gains in in-distribution accuracy generally improve OOD accuracy, a phenomenon termed accuracy on the line, which contradicts the expected harm of spurious correlations. We show that these observations are an artifact of misspecified OOD datasets that do not include shifts in spurious correlations that harm OOD generalization, the setting they are meant to evaluate. Consequently, current practice evaluates "robustness" without truly stressing the spurious signals we seek to eliminate; our work pinpoints when that happens and how to fix it. Contributions. (i) We derive necessary and sufficient conditions for a distribution shift to reveal a model's reliance on spurious features; when these conditions hold, "accuracy on the line" disappears. (ii) We audit leading OOD datasets and find that most still display accuracy on the line, suggesting they are misspecified for evaluating robustness to spurious correlations. (iii) We catalog the few well-specified datasets and summarize generalizable design principles, such as identifying datasets of natural interventions (e.g., a pandemic), to guide future well-specified benchmarks.
Show more
Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE
Lokesh Nagalapatti · Pranava Singhal · Avishek Ghosh · Sunita Sarawagi
Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator’s influence on the learning objective based on the simulator’s relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet’s efficacy against state-of-the-art CATE baselines.
Show more
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu · Jinliang Zheng · Boxiao Liu · Yu Liu · Hongsheng Li
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces $\textbf{U}$nmasked $\textbf{T}$oken $\textbf{A}$lignment ($\textbf{UTA}$), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra $\mathrm{[MASK]}$ tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks.
Show more
AB-UPT: Scaling Neural CFD Surrogates for High- Fidelity Automotive Aerodynamics Simulations via Anchored- Branched Universal Physics Transformers
Benedikt Alkin · Maurits Bleeker · Richard Kurle · Tobias Kronlachner · Reinhard Sonnleitner · Matthias Dorfer · Johannes Brandstetter
Recent advances in neural surrogate modeling offer the potential for transformative innovations in applications such as automotive aerodynamics. Yet, industrial-scale problems often involve volumetric meshes with cell counts reaching 100 million, presenting major scalability challenges. Complex geometries further complicate modeling through intricate surface-volume interactions, while quantities such as vorticity are highly nonlinear and must satisfy strict divergence-free constraints. To address these requirements, we introduce AB-UPT as a novel modeling scheme for building neural surrogates for CFD simulations. AB-UPT is designed to: (i) decouple geometry encoding and prediction tasks via multi-branch operators; (ii) enable scalability to high-resolution outputs via neural simulation in a low-dimensional latent space, coupled with anchored neural field decoders to predict high-fidelity outputs; (iii) enforce physics consistency by a divergence-free formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of surface and volume fields on automotive CFD simulations ranging from 33 thousand up to 150 million mesh cells. Furthermore, our anchored neural field architecture enables the enforcement of hard physical constraints on the physics predictions without degradation in performance, exemplified by modeling divergence-free vorticity fields. Notably, the proposed models can be trained on a single GPU in less than a day and predict industry-standard surface and volume fields within seconds. Additionally, we show that the flexible design of our method enables neural simulation from a CAD geometry alone, thereby eliminating the need for costly CFD meshing procedures for inference.
Show more
Online Selective Conformal Inference: Errors and Solutions
Yusuf Sale · Aaditya Ramdas
In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining trade-offs between valid methods.
Show more
Simplex Constrained Sparse Optimization via Tail Screening
Peng Chen ⋅ Jin Zhu ⋅ Junxian Zhu ⋅ Xueqin Wang
We consider the probabilistic simplex-constrained sparse recovery problem. The commonly used Lasso-type penalty for promoting sparsity is ineffective in this context since it is a constant within the simplex. Despite this challenge, fortunately, simplex constraint itself brings a self-regularization property, i.e., the empirical risk minimizer without any sparsity-promoting procedure obtains the usual Lasso-type estimation error. Moreover, we analyze the iterates of a projected gradient descent method and show its convergence to the ground truth sparse solution in the geometric rate until a satisfied statistical precision is attained. Although the estimation error is statistically optimal, the resulting solution is usually more dense than the sparse ground truth. To further sparsify the iterates, we propose a method called PERMITS via embedding a tail screening procedure, i.e., identifying negligible components and discarding them during iterations, into the projected gradient descent method. Furthermore, we combine tail screening and the special information criterion to balance the trade-off between fitness and complexity. Theoretically, the proposed PERMITS method can exactly recover the ground truth support set under mild conditions and thus obtain the oracle property. We demonstrate the statistical and computational efficiency of PERMITS with both synthetic and real data. The implementation of the proposed method can be found in https://github.com/abess-team/PERMITS.
Show more
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning
Manu Gaur · Darshan Singh S · Makarand Tapaswi
Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe.
Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters. We also outperform vanilla SR by +14.4% to +19.5%.
Show more
Model Tensor Planning
Model Tensor Planning Download PDF An Thai Le · Khai Nguyen · Minh Nhat VU · Joao Carvalho · Jan Peters
Sampling-based model predictive control (MPC) offers strong performance in nonlinear and contact-rich robotic tasks, yet often suffers from poor exploration due to locally greedy sampling schemes. We propose \emph{Model Tensor Planning} (MTP), a novel sampling-based MPC framework that introduces high-entropy control trajectory generation through structured tensor sampling. By sampling over randomized multipartite graphs and interpolating control trajectories with B-splines and Akima splines, MTP ensures smooth and globally diverse control candidates. We further propose a simple $\beta$-mixing strategy that blends local exploitative and global exploratory samples within the modified Cross-Entropy Method (CEM) update, balancing control refinement and exploration. Theoretically, we show that MTP achieves asymptotic path coverage and maximum entropy in the control trajectory space in the limit of infinite tensor depth and width.
Our implementation is fully vectorized using JAX and compatible with MuJoCo XLA, supporting \emph{Just-in-time} (JIT) compilation and batched rollouts for real-time control with online domain randomization. Through experiments on various challenging robotic tasks, ranging from dexterous in-hand manipulation to humanoid locomotion, we demonstrate that MTP outperforms standard MPC and evolutionary strategy baselines in task success and control robustness. Design and sensitivity ablations confirm the effectiveness of MTP’s tensor sampling structure, spline interpolation choices, and mixing strategy. Altogether, MTP offers a scalable framework for robust exploration in model-based planning and control.
Show more
PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans
Giang Nguyen · Valerie Chen · Mohammad Reza Taesiri · Anh Totti Nguyen
Nearest neighbors (NN) are traditionally used to compute final decisions, e.g., in Support Vector Machines or k-NN classifiers, and to provide users with explanations for the model's decision.
In this paper, we show a novel utility of nearest neighbors: To improve predictions of a frozen, pretrained image classifier C.
We leverage an image comparator S that (1) compares the input image with NN images from the top-K most probable classes given by C; and (2) uses scores from S to weight the confidence scores of C to refine predictions.
Our method consistently improves fine-grained image classification accuracy on CUB-200, Cars-196, and Dogs-120.
Also, a human study finds that showing users our probable-class nearest neighbors (PCNN) reduces over-reliance on AI, thus improving their decision accuracy over prior work which only shows only the most-probable (top-1) class examples.
Show more
Learning Deformable Body Interactions With Adaptive Spatial Tokenization
Hao Wang · Yu Liu · Daniel Biggs · Haoru Wang · Jiandong Yu · Ping Huang
Simulating interactions between deformable bodies is vital in fields like material science, mechanical design, and robotics. While learning-based methods with Graph Neural Networks (GNNs) are effective at solving complex physical systems, they encounter scalability issues when modeling deformable body interactions. To model interactions between objects, pairwise global edges have to be created dynamically, which is computationally intensive and impractical for large-scale meshes. To overcome these challenges, drawing on insights from geometric representations, we propose an Adaptive Spatial Tokenization (AST) method for efficient representation of physical states. By dividing the simulation space into a grid of cells and mapping unstructured meshes onto this structured grid, our approach naturally groups adjacent mesh nodes. We then apply a cross-attention module to map the sparse cells into a compact, fixed-length embedding, serving as tokens for the entire physical state. Self-attention modules are employed to predict the next state over these tokens in latent space. This framework leverages the efficiency of tokenization and the expressive power of attention mechanisms to achieve accurate and scalable simulation results. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches in modeling deformable body interactions. Notably, it remains effective on large-scale simulations with meshes exceeding 100,000 nodes, where existing methods are hindered by computational limitations. Additionally, we contribute a novel large-scale dataset encompassing a wide range of deformable body interactions to support future research in this area.
Show more
Discrete Audio Tokens: More Than a Survey!
Pooneh Mousavi · Gallil Maimon · Adel Moumen · Darius Petermann · Jiatong Shi · Haibin Wu · Haici Yang · Anastasia Kuznetsova · Artem Ploujnikov · Ricard Marxer · Bhu- vana Ramabhadran · Benjamin Elizalde · Loren Lugosch · Jinyu Li · Cem Subakan · Phil Woodland · Minje Kim · Hung-yi Lee · Shinji Watanabe · Yossi Adi · Mirco Ravanelli
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Show more
Information Theoretic Guarantees For Policy Alignment In Large Language Models
Youssef Mroueh · Apoorva Nitsure
Policy alignment of large language models refers to constrained policy optimization, where the policy is optimized to maximize a reward while staying close to a reference policy based on an $f$-divergence like $\mathsf{KL}$ divergence. The best of $n$ alignment policy selects the sample with the highest reward from $n$ independent samples. Recent work shows that the reward improvement of the aligned policy scales as $\sqrt{\mathsf{KL}}$, with an explicit bound on the $\mathsf{KL}$ for best of $n$ policies. We show that this $\sqrt{\mathsf{KL}}$ bound holds if the reference policy’s reward has sub-gaussian tails. For best of $n$ policies, the $\mathsf{KL}$ bound applies to any $f$-divergence through a reduction to exponential order statistics using the Rényi representation. Tighter control can be achieved with Rényi divergence if additional tail information is known. Finally, we demonstrate how these bounds transfer to golden rewards, resulting in decreased golden reward improvement due to proxy reward overestimation and approximation errors.
Show more
Successful Page Load