Track: Poster Session 6 Pavilion 4

Poster

P4-#3001

Evolutionary Caching to Accelerate Your Off-the-Shelf Diffusion Model

Anirud Aggarwal ⋅ Abhinav Shrivastava ⋅ Matthew Gwilliam

Diffusion-based image generation models excel at producing high-quality synthetic content, but suffer from slow and computationally expensive inference. Prior work has attempted to mitigate this by caching and reusing features within diffusion transformers across inference steps. These methods, however, often rely on rigid heuristics that result in limited acceleration or poor generalization across architectures. We propose Evolutionary Caching to Accelerate Diffusion models (ECAD), a genetic algorithm that learns efficient, per-model, caching schedules forming a Pareto frontier, using only a small set of calibration prompts. ECAD requires no modifications to network parameters or reference images. It offers significant inference speedups, enables fine-grained control over the quality-latency trade-off, and adapts seamlessly to different diffusion models. Notably, ECAD's learned schedules can generalize effectively to resolutions and model variants not seen during calibration. We evaluate ECAD on PixArt-alpha, PixArt-Sigma, and FLUX.1-dev using multiple metrics (FID, CLIP, Image Reward) across diverse benchmarks (COCO, MJHQ-30k, PartiPrompts), demonstrating consistent improvements over previous approaches. On PixArt-alpha, ECAD identifies a schedule that outperforms the previous state-of-the-art method by 4.47 COCO FID while increasing inference speedup from 2.35x to 2.58x. Our results establish ECAD as a scalable and generalizable approach for accelerating diffusion inference. Our project page and code are available here: https://research.aniaggarwal.com/ecad

Poster

P4-#3002

Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps

Shan Wang ⋅ Peixia Li ⋅ Chenchen Xu ⋅ Ziang Cheng ⋅ Jiayu Yang ⋅ Hongdong Li ⋅ Pulak Purkait

We propose Light–Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light–shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting-unlike prior methods that treat them as disjoint tasks-capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light–shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.

Poster

P4-#3003

ReactID: Synchronizing Realistic Actions and Identity in Personalized Video Generation

Wei Li ⋅ Yiheng Zhang ⋅ Fuchen Long ⋅ Zhaofan Qiu ⋅ Ting Yao ⋅ Xiaoyan Sun ⋅ Tao Mei

Personalized video generation faces a fundamental trade-off between identity consistency and action realism: overly rigid identity preservation often leads to unnatural motion, while emphasis on action dynamics can compromise subject fidelity. This tension stems from three interrelated challenges: imprecise subject-video alignment, unstable training due to varying sample difficulties, and inadequate modeling of fine-grained actions. To address this, we propose ReactID, a comprehensive framework that harmonizes identity accuracy and motion naturalness through coordinated advances in data, training, and action modeling. First, we construct ReactID-Data, a large-scale dataset annotated with a high-precision pipeline combining vision-based entity label extraction, MLLM-based subject detection, and post-verification to ensure reliable subject-video correspondence. Second, we analyze learning difficulty along dimensions such as subject size, appearance similarity, and sampling strategy, and devise a progressive training curriculum that evolves from easy to hard samples, ensuring stable convergence while avoiding identity overfitting and copy-paste artifacts. Third, ReactID introduces a novel timeline-based conditioning mechanism that supplements monolithic text prompts with structured multi-action sequences. Each sub-action is annotated with precise timestamps and descriptions, and integrated into the diffusion model via two novel components: subject-aware cross-attention module to bind sub-action to the specific subject of interest and temporally-adaptive RoPE to embed the rescaled temporal coordinates invariant to action duration. Experiments show that ReactID achieves state-of-the-art performance in both identity preservation and action realism, effectively balancing the two objectives.

Poster

P4-#3004

Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective

Hangjie Yuan ⋅ Weihua Chen ⋅ Jun CEN ⋅ Hu Yu ⋅ Jingyun Liang ⋅ Shuning Chang ⋅ Zhihui Lin ⋅ Tao Feng ⋅ Pengwei Liu ⋅ Jiazheng Xing ⋅ Hao Luo ⋅ Jiasheng Tang ⋅ Fan Wang ⋅ Yi Yang

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive (AR) video generation. Existing AR video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an LLM-based unified model for AR video generation with efficient discrete diffusion. Firstly, to fit videos with LLMs, we identify that 1D RoPE is ill-suited for visual spatiotemporal correlation modeling, and while demonstrated to be useful, naive 3D RoPE exhibits imbalanced frequency spectra. Therefore, we propose MM‑RoPE, which preserves the original textual RoPE while seamlessly accommodating video data with comprehensive frequency spectra and scaled 3D positions. Secondly, to fit the video data's nature and overcome the inefficiency of next-token decoding, we adopt a parallel and mask-based discrete diffusion with the intra-frame bidirectional and inter-frame causal attention masks. Based on this attention mask, we uncover the frame‑wise loss imbalance issue caused by spatial information redundancy and propose Autoregressive Discrete Diffusion Forcing, which introduces temporal tube masking during training with a compatible inference‑time masking policy to avoid quality degradation. Despite using only 48 GPUs for pre-training, limited data and a discrete tokenizer, Lumos-1 achieves results surpassing those of Show-o2 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V.

Poster

P4-#3005

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Joonghyuk Shin ⋅ Zhengqi Li ⋅ Richard Zhang ⋅ Jun-Yan Zhu ⋅ Jaesik Park ⋅ Eli Shechtman ⋅ Xun Huang

Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons -- (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by preventing error accumulation, and (3) maintaining fast inference, without incurring growth in computational cost due to increasing context windows. A key to our approach is introducing carefully designed sliding-window causal attention, combined with attention sinks. By incorporating self-rollout with attention sinks and KV cache rolling during training, we properly simulate inference-time extrapolations with a fixed context window, enabling constant-speed generation of arbitrarily long videos. Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming. With MotionStream, users can paint trajectories, control cameras, or transfer motion, and see results unfold in real-time, delivering a truly interactive experience.

Poster

P4-#3006

Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization

Tong Shao ⋅ Yusen Fu ⋅ Guoying Sun ⋅ Jingde Kong ⋅ Zhuotao Tian ⋅ Jingyong Su

Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$\alpha$, StableDiffusion1.5 and Hunyuan. Our code is released publicly at https://github.com/leaves162/CEM.

Poster

P4-#3007

DiCache: Let Diffusion Model Determine Its Own Cache

Jiazi Bu ⋅ Pengyang Ling ⋅ Yujie Zhou ⋅ Yibin Wang ⋅ Yuhang Zang ⋅ Dahua Lin ⋅ Jiaqi Wang

Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) Dynamic Cache Trajectory Alignment adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCache’s capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.

Poster

P4-#3008

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

Yifan Chang ⋅ Jie Qin ⋅ Limeng Qiao ⋅ Xiaofeng Wang ⋅ Zheng Zhu ⋅ Lin Ma ⋅ Xingang Wang

Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress–process–recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100\%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100\% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.

Poster

P4-#3009

Anchor Frame Bridging for Coherent First-Last Frame Video Generation

Xuehan Hou ⋅ Meng Fan ⋅ Pengchong Qiao ⋅ Rat Cheng ⋅ Yian Zhao ⋅ Lei Zhu ⋅ Kaiwen Cheng ⋅ Chang Liu ⋅ Jie Chen

First-last frame video generation has recently gained significant attention. It enables coherent motion generation between specified first and last frames. However, this approach suffers from semantic degradation in intermediate frames, causing scene distortion and subject deformation that undermine temporal consistency. To address this issue, we introduce Anchor Frame Bridging (AFB), a novel plug-and-play method that explicitly bridges semantic continuity from boundary frames to intermediate frames, offering training-free adaptability and generalizability. By adaptively interpolating anchor frames at temporally critical locations exhibiting maximal semantic discontinuities, our approach effectively mitigates semantic drift in intermediate frames. Specifically, we propose an adaptive anchor frame selection module, which generates text-aligned candidate frames via frame order reversal and selects anchors based on semantic continuity. Subsequently, we develop anchor frame guided generation, which leverages the selected anchor frames to guide semantic propagation across intermediate frames, ensuring consistent boundary semantics and preserving temporal coherence throughout the video sequence. The final video is synthesized using the first frame, last frame, selected anchor frames, and the text prompt. The results demonstrate that our method significantly enhances the temporal consistency and overall quality of generated videos. Specifically, when applied to the Wan2.1-I2V model, it yields improvements of 16.58\% in FVD and 10.21\% in PSNR. The codes are provided in the supplementary material.

Poster

P4-#3010

Group Critical-token Policy Optimization for Autoregressive Image Generation

Guohui Zhang ⋅ Hu Yu ⋅ Xiaoxiao Ma ⋅ JingHao Zhang ⋅ Yaning Pan ⋅ Mingde Yao ⋅ Jie Xiao ⋅ Linjiang Huang ⋅ Jie Huang ⋅ Feng Zhao

Recent studies have extended Reinforcement Learning with Verifiable Rewards (RLVR) to autoregressive (AR) visual generation and achieved promising progress. However, existing methods typically apply uniform optimization across all image tokens, while the varying contributions of different image tokens for RLVR's training remain unexplored. In fact, the key obstacle lies in how to identify more critical image tokens during AR generation and implement effective token-wise optimization for them. To tackle this challenge, we propose $\textbf{G}$roup $\textbf{C}$ritical-token $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{GCPO}$), which facilitates effective policy optimization on critical tokens. We identify the critical tokens in RLVR-based AR generation from three perspectives, specifically: $\textbf{(1)}$ Causal dependency: early tokens fundamentally determine the later tokens and final image effect due to unidirectional dependency; $\textbf{(2)}$ Entropy-induced spatial structure: tokens with high entropy gradients correspond to image structure and bridges distinct visual regions; $\textbf{(3)}$ RLVR-focused token diversity: tokens with low visual similarity across a group of sampled images contribute to richer token-level diversity. For these identified critical tokens, we further introduce a dynamic token-wise advantage weight to encourage exploration, based on confidence divergence between the policy model and reference model. By leveraging 30\% of the image tokens, GCPO achieves better performance than GRPO with full tokens. Extensive experiments on multiple text-to-image benchmarks for both AR models and unified multimodal models demonstrate the effectiveness of GCPO for AR visual generation.

Poster

P4-#3013

HiGS: History-Guided Sampling for Plug-and-Play Enhancement of Diffusion Models

Seyedmorteza Sadat ⋅ Farnood Salehi ⋅ Romann Weber

While diffusion models have made remarkable progress in image generation, their outputs can still appear unrealistic and lack fine details, especially when using fewer number of neural function evaluations (NFEs) or lower guidance scales. To address this issue, we propose a novel momentum-based sampling technique, termed history-guided sampling (HiGS), which enhances quality and efficiency of diffusion sampling by integrating recent model predictions into each inference step. Specifically, HiGS leverages the difference between the current prediction and a weighted average of past predictions to steer the sampling process toward more realistic outputs with better details and structure. Our approach introduces practically no additional computation and integrates seamlessly into existing diffusion frameworks, requiring neither extra training nor fine-tuning. Extensive experiments show that HiGS consistently improves image quality across diverse models and architectures and under varying sampling budgets and guidance scales. Moreover, using a pretrained SiT model, HiGS achieves a new state-of-the-art FID of 1.61 for unguided ImageNet generation at 256$\times$256 with only 30 sampling steps (instead of the standard 250). We thus present HiGS as a plug-and-play enhancement to standard diffusion sampling that enables faster generation with higher fidelity.

Poster

P4-#3014

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

Xuan Ju ⋅ Tianyu Wang ⋅ Yuqian Zhou ⋅ HE Zhang ⋅ Qing Liu ⋅ Cherry Zhao ⋅ Zhifei Zhang ⋅ Yijun Li ⋅ Yuanhao Cai ⋅ Shaoteng Liu ⋅ Daniil Pakhomov ⋅ Zhe Lin ⋅ Soo Ye Kim ⋅ Qiang Xu

Recent advances in foundation models highlight a clear trend toward unification and scaling, showing emergent capabilities across diverse domains. While image generation and editing have rapidly transitioned from task-specific to unified frameworks, video generation and editing remain fragmented due to architectural limitations and data scarcity. In this work, we introduce EditVerse, a unified framework for image and video generation and editing within a single model. By representing all modalities, i.e., text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning, natural cross-modal knowledge transfer, and flexible handling of inputs and outputs with arbitrary resolutions and durations. To address the lack of video editing training data, we design a scalable data pipeline that curates 232K video editing samples and combines them with large-scale image and video datasets for joint training. Furthermore, we present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions. Extensive experiments and user studies demonstrate that EditVerse achieves state-of-the-art performance, surpassing existing open-source and commercial models, while exhibiting emergent editing and generation abilities across modalities.

Poster

P4-#3015

Continuous Space-Time Video Super-Resolution with 3D Fourier Fields

Alexander Becker ⋅ Julius Erbach ⋅ Dominik Narnhofer ⋅ Konrad Schindler

We introduce a novel formulation for continuous space-time video super-resolution. Instead of decoupling the representation of a video sequence into separate spatial and temporal components and relying on brittle, explicit frame warping for motion compensation, we encode video as a continuous, spatio-temporally coherent 3D Video Fourier Field (VFF). That representation offers three key advantages: (1) it enables cheap, flexible sampling at arbitrary locations in space and time; (2) it is able to simultaneously capture fine spatial detail and smooth temporal dynamics; and (3) it offers the possibility to include an analytical, Gaussian point spread function in the sampling to ensure aliasing-free reconstruction at arbitrary scale. The coefficients of the proposed, Fourier-like sinusoidal basis are predicted with a neural encoder with a large spatio-temporal receptive field, conditioned on the low-resolution input video. Through extensive experiments, we show that our joint modeling substantially improves both spatial and temporal super-resolution and sets a new state of the art for multiple benchmarks: across a wide range of upscaling factors, it delivers sharper and temporally more consistent reconstructions than existing baselines, while being computationally more efficient. Project page: https://v3vsr.github.io.

Poster

P4-#3016

SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

Ouxiang Li ⋅ Yuan Wang ⋅ Xinting Hu ⋅ Houcheng Jiang ⋅ Jack tao ⋅ Yanbin Hao ⋅ James Ma ⋅ Fuli Feng

Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, privacy violations, and offensive content. In scalable erasure applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, a scalable, precise, and efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.

Poster

P4-#3017

SPRINT: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Dogyun Park ⋅ Moayed Haji-Ali ⋅ Yanyu Li ⋅ Willi Menapace ⋅ Sergey Tulyakov ⋅ Hyunwoo Kim ⋅ Aliaksandr Siarohin ⋅ Anil Kag

Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT (Sparse--Dense Residual Fusion for Efficient Diffusion Transformers), a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train--inference gap. On ImageNet-1K 256^2, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

Poster

P4-#3018

QVGen: Pushing the Limit of Quantized Video Generative Models

Yushi Huang ⋅ Ruihao Gong ⋅ Jing Liu ⋅ Yifu Ding ⋅ Chengtao Lv ⋅ Haotong Qin ⋅ Jun Zhang

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present *QVGen*, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (*e.g.*, $4$-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a *rank-decay* strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3\text{B}\sim14\text{B}$, show that QVGen is *the first* to reach full-precision comparable quality under $4$-bit settings. Moreover, it significantly outperforms existing methods. For instance, our $3$-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench. Code and models are available at https://github.com/ModelTC/QVGen.

Poster

P4-#3118

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li ⋅ Wentao Pan ⋅ Po-Chien Luan ⋅ Yang Gao ⋅ Alexandre Alahi

We propose Stable Video Infinity (SVI) that can generate non-looping, ultra-long videos with stable visual quality, while supporting per-clip prompt control and multi-modal conditioning. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)’s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

Poster

P4-#3117

Arbitrary-Shaped Image Generation via Spherical Neural Field Diffusion

Jiyuan Xia ⋅ Yuanshen Guan ⋅ Ruikang Xu ⋅ Zhiwei Xiong

Existing diffusion models excel at generating diverse content, but remain confined to fixed image shapes and lack the ability to flexibly control spatial attributes such as viewpoint, field-of-view (FOV), and resolution. To fill this gap, we propose Arbitrary-Shaped Image Generation (ASIG), the first generative framework that enables precise spatial attribute control while supporting high-quality synthesis across diverse image shapes (e.g., perspective, panoramic, and fisheye). ASIG introduces two key innovations: (1) a mesh-based spherical latent diffusion to generate a complete scene representation, with seam enforcement denoising strategy to maintain semantic and spatial consistency across viewpoints; and (2) a spherical neural field to sample arbitrary regions from the scene representation with coordinate conditions, enabling distortion-free generation at flexible resolutions. To this end, ASIG enables precise control over spatial attributes within a unified framework, enabling high-quality generation across diverse image shapes. Experiments demonstrate clear improvements over prior methods specifically designed for individual shapes. Code is available at https://github.com/xjyjjy/ASIG.

Poster

P4-#3116

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu ⋅ Diankun Wu ⋅ Tianyu He ⋅ Junliang Guo ⋅ Yang Ye ⋅ Yueqi Duan ⋅ Jiang Bian

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge this gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize latent 3D representations. Our key insight is to guide the model’s intermediate representations toward geometry-aware structure by aligning them with features from a pretrained geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing unnormalized geometric features from normalized diffusion representation. We evaluate Geometry Forcing on both camera view–conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods.

Poster

P4-#3115

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang ⋅ Zhaoyang Li ⋅ Duomin Wang ⋅ jiahe zhang ⋅ Deyu Zhou ⋅ Zixin Yin ⋅ Xili Dai ⋅ Gang Yu ⋅ Xiu Li

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over $8,743$ hours, SpeakerVid-5M contains more than $5.2$ million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark (VidChatBench) for future work. Both the dataset and the corresponding data processing code will be publicly released.

Poster

P4-#3114

Trajectory-aware Shifted State Space Models for Online Video Super-Resolution

Qiang Zhu ⋅ Xiandong MENG ⋅ Yuxuan Jiang ⋅ Fan Zhang ⋅ David Bull ⋅ Shuyuan Zhu ⋅ Bing Zeng ⋅ Ronggang Wang

Online video super-resolution (VSR) is an important technique for many real-world video processing applications, which aims to restore the current high-resolution video frame based on temporally previous frames. Most of the existing online VSR methods solely employ one neighboring previous frame to achieve temporal alignment, which limits long-range temporal modeling of videos. Recently, state space models (SSMs) have been proposed with linear computational complexity and a global receptive field, which significantly improve computational efficiency and performance. In this context, this paper presents a novel online VSR method based on Trajectory-aware Shifted SSMs (TS-Mamba), leveraging both long-term trajectory modeling and low-complexity Mamba to achieve efficient spatio-temporal information aggregation. Specifically, TS-Mamba first constructs the trajectories within a video to select the most similar tokens from the previous frames. Then, a Trajectory-aware Shifted Mamba Aggregation (TSMA) module consisting of proposed shifted SSMs blocks is employed to aggregate the selected tokens. The shifted SSMs blocks are designed based on Hilbert scannings and corresponding shift operations to compensate for scanning losses and strengthen the spatial continuity of Mamba. Additionally, we propose a trajectory-aware loss function to supervise the trajectory generation, ensuring the accuracy of token selection when training our model. Extensive experiments on three widely used VSR test datasets demonstrate that compared with six online VSR benchmark models, our TS-Mamba achieves state-of-the-art performance in most cases and over 22.7% complexity reduction (in MACs). The source code for TS-Mamba is available at https://github.com/QZ1-boy/TS-Mamba.

Poster

P4-#3113

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

Hongyu Zhang ⋅ Yufan Deng ⋅ Zilin Pan ⋅ Peng-Tao Jiang ⋅ Bo Li ⋅ Qibin Hou ⋅ Zhen Dong ⋅ Zhiyang Dou ⋅ Zhou Daquan

Generating high-quality videos from complex temporal descriptions, which refer to prompts containing multiple sequential actions, remains a significant challenge. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt following capability. We attribute this problem to two primary causes: temporal misalignment between video content and the prompt, and conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and video demos are available in the supplementary materials.

Poster

P4-#3112

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

Zixin Yin ⋅ Xili Dai ⋅ Duomin Wang ⋅ Xianfang Zeng ⋅ Lionel Ni ⋅ Gang Yu ⋅ Heung-Yeung Shum

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball'', or for ambiguous drags, making context-aware changes like moving hands into pockets. Moreover, LazyDrag supports multi-round edits with simultaneous move and scale operations. Evaluated on DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by mean distances, VIEScore and user studies. LazyDrag not only sets new state-of-the-art performance, but also paves a new way to editing paradigms. Here is the project website.

Poster

P4-#3111

WILD-Diffusion: A WDRO Inspired Training Method for Diffusion Models under Limited Data

Xianglu Wang ⋅ Wanlin Zhang ⋅ Hu Ding

Diffusion models have recently emerged as a powerful class of generative models and have achieved state-of-the-art performance in various image synthesis tasks. However, training diffusion models generally requires large amounts of data and suffer from overfitting when the dataset size is limited. To address these limitations, we propose a novel method called WILD-Diffusion, which is inspired by Wasserstein Distributionally Robust Optimization (WDRO), an important and elegant mathematical formulation from robust optimization area. Specifically, WILD-Diffusion utilizes WDRO to iteratively generate new training samples within a Wasserstein distance based uncertainty set centered at the limited data data distribution. This carefully designed method can progressively augment the training set throughout the training process and effectively overcome the obstacles caused by the limited data issue. Moreover, we establish the convergence guarantee for our algorithm even though the mixture of diffusion process and WDRO brings significant challenges to our analysis in theory. Finally, we conduct a set of experiments to verify the effectiveness of our proposed method. With WILD-Diffusion, we can achieve more than a $10$% reduction in FID using only $20$% of the training data across different datasets. Moreover, our method can attain state-of-the-art FID with as few as $100$ images, both in pretrained and non-pretrained settings.

Poster

P4-#3110

Controllable Video Generation with Provable Disentanglement

Yifan Shen ⋅ Peiyuan Zhu ⋅ Zijian Li ⋅ Shaoan Xie ⋅ Namrata Deka ⋅ Zongfang Liu ⋅ Zeyu Tang ⋅ Guangyi Chen ⋅ Kun Zhang

Controllable video generation remains a significant challenge, despite recent advances in generating high-quality and consistent videos. Most existing methods for controlling video generation treat the video as a whole, neglecting intricate fine-grained spatiotemporal relationships, which limits both control precision and efficiency. In this paper, we propose \textbf{Co}ntrollable \textbf{V}ide\textbf{o} \textbf{G}enerative \textbf{A}dversarial \textbf{N}etworks (\ourmes) to disentangle the video concepts, thus facilitating efficient and independent control over individual concepts. Specifically, following the \textbf{minimal change principle}, we first disentangle static and dynamic latent variables. We then leverage the \textbf{sufficient change property} to achieve component-wise identifiability of dynamic latent variables, enabling independent control over motion and identity. To establish the theoretical foundation, we provide a rigorous analysis demonstrating the identifiability of our approach. Building on these theoretical insights, we design a \textbf{Temporal Transition Module} to disentangle latent dynamics. To enforce the minimal change principle and sufficient change property, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. To validate our approach, we integrate this module as a plug-in for GANs. Extensive qualitative and quantitative experiments on various video generation benchmarks demonstrate that our method significantly improves generation quality and controllability across diverse real-world scenarios.

Poster

P4-#3109

Unified 3D Scene Understanding Through Physical World Modeling

Wanhee Lee ⋅ Klemen Kotar ⋅ Rahul Venkatesh ⋅ Jared Watrous ⋅ Honglin Chen ⋅ Khai Loong Aw ⋅ Daniel Yamins

Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction 3WM, formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.

Poster

P4-#3108

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

Han Song ⋅ Yucheng Zhou ⋅ Jianbing Shen ⋅ Yu Cheng

Combining Chain-of-Thought (CoT) with Reinforcement Learning (RL) improves text-to-image (T2I) generation, yet the underlying interaction between CoT's exploration and RL's optimization remains unclear. We present a systematic entropy-based analysis that yields three key insights: (1) CoT expands the generative exploration space, while RL contracts it toward high-reward regions; (2) final reward is strongly negatively correlated with both the mean and variance of image-token entropy, highlighting the need to reduce uncertainty and instability; and (3) the entropy of the textual CoT directly governs downstream image quality, with lower-entropy CoTs leading to better generations. Motivated by these findings, we propose Entropy-Guided Group Relative Policy Optimization (EG-GRPO), a fine-tuning strategy that reallocates optimization budget by uncertainty: low-entropy tokens are excluded from reward-driven updates to preserve stability, while high-entropy tokens receive an entropy bonus that encourages structured exploration without collapse. Experiments on standard T2I benchmarks demonstrate that EG-GRPO achieves state-of-the-art performance.

Poster

P4-#3107

Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer

Zixin Yin ⋅ Xili Dai ⋅ Ling-Hao Chen ⋅ Deyu Zhou ⋅ Jianan Wang ⋅ Duomin Wang ⋅ Gang Yu ⋅ Lionel Ni ⋅ Lei Zhang ⋅ Heung-Yeung Shum

Text-guided color editing in images and videos is a fundamental yet unsolved problem, requiring fine-grained manipulation of color attributes, including albedo, light source color, and ambient lighting, while preserving physical consistency in geometry, material properties, and light-matter interactions. Existing training-free approaches provide broad applicability across editing tasks but struggle with precise color control and often introduce visual inconsistency in both edited and non-edited regions. In this work, we present ColorCtrl, a training-free color editing method that leverages the attention mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing, along with word-level control of attribute intensity. Our method modifies only the intended regions specified by the prompt, leaving unrelated areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate that ColorCtrl outperforms existing training-free approaches and achieves state-of-the-art performances in both edit quality and consistency. Furthermore, our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency. When extended to video models like CogVideoX, our approach exhibits greater advantages, particularly in maintaining temporal coherence and editing stability. Finally, our method generalizes to instruction-based editing diffusion models such as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility. Here is the website.

Poster

P4-#3106

PI-Light: Physics-Inspired Diffusion for Full-Image Relighting

Zhexin Liang ⋅ Zhaoxi Chen ⋅ Yongwei Chen ⋅ Tianyi Wei ⋅ Tengfei Wang ⋅ Xingang Pan

Full-image relighting remains a challenging problem due to the difficulty of collecting large-scale structured paired data, the difficulty of maintaining physical plausibility, and the limited generalizability imposed by data-driven priors. Existing attempts to bridge the synthetic-to-real gap for full-scene relighting remain suboptimal. To tackle these challenges, we introduce **P**hysics-**I**nspired diffusion for full-image re**Light** ($\pi$-Light, or PI-Light), a two-stage framework that leverages physics-inspired diffusion models. Our design incorporates (i) batch-aware attention, which improves the consistency of intrinsic predictions across a collection of images, (ii) a physics-guided neural rendering module that enforces physically plausible light transport, (iii) physics-inspired losses that regularize training dynamics toward a physically meaningful landscape, thereby enhancing generalizability to real-world image editing, and (iv) a carefully curated dataset of diverse objects and scenes captured under controlled lighting conditions. Together, these components enable efficient finetuning of pretrained diffusion models while also providing a solid benchmark for downstream evaluation. Experiments demonstrate that $\pi$-Light synthesizes specular highlights and diffuse reflections across a wide variety of materials, achieving superior generalization to real-world scenes compared with prior approaches.

Poster

P4-#3105

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo ⋅ Songhao Han ⋅ Yuandong Pu ⋅ Boxiang Qiu ⋅ Sayak Paul ⋅ Yue Liao ⋅ Yihao Liu ⋅ Jie Shao ⋅ Xi Chen ⋅ Si Liu ⋅ Hongsheng Li

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning, text rendering, and multimodal reasoning for factual fidelity. To address this, we present the first comprehensive, systematic investigation of this domain, encompassing data construction, model training, and an evaluation benchmark. First, we construct a large-scale dataset of 1.3 million high-quality structured image pairs derived from executable drawing programs and augmented with chain-of-thought reasoning annotations. Building on it, we train a unified model that integrates a VLM with FLUX.1 Kontext via a lightweight connector for enhanced multimodal understanding. A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation, further boosted by an external reasoner at inference time. Finally, we introduce StructBench, a novel benchmark for generation and editing with over 1,700 challenging instances, and an accompanying evaluation metric, StructScore, which employs a multi-round Q&A protocol to assess fine-grained factual accuracy. Evaluations of 15 models reveal that even leading closed-source systems remain far from satisfactory. Our model attains strong editing performance, and inference-time reasoning yields consistent gains across diverse architectures. By releasing the dataset, model, and benchmark, we aim to advance unified multimodal foundations for structured visuals.

Poster

P4-#3104

RegionE: Adaptive Region-Aware Generation for Efficient Image Editing

Pengtao Chen ⋅ Xianfang Zeng ⋅ Maosen Zhao ⋅ Mingzhu Shen ⋅ Peng Ye ⋅ Bangyin Xiang ⋅ Zhibo Wang ⋅ Wei Cheng ⋅ Gang Yu ⋅ Tao Chen

Recently, instruction-based image editing (IIE) has received widespread attention. In practice, IIE often modifies only specific regions of an image, while the remaining areas largely remain unchanged. Although these two types of regions differ significantly in generation difficulty and computational redundancy, existing IIE models do not account for this distinction, instead applying a uniform generation process across the entire image. This motivates us to propose \textbf{RegionE}, an adaptive, region-aware generation framework that accelerates IIE tasks without additional training. Specifically, the RegionE framework consists of three main components: 1) Adaptive Region Partition. We observed that the trajectory of unedited regions is straight, allowing for multi-step denoised predictions to be inferred in a single step. Therefore, in the early denoising stages, we partition the image into edited and unedited regions based on the difference between the final estimated result and the reference image. 2) Region-Aware Generation. After distinguishing the regions, we replace multi-step denoising with one-step prediction for unedited areas. For edited regions, the trajectory is curved, requiring local iterative denoising. To improve the efficiency and quality of local iterative generation, we propose the Region-Instruction KV Cache, which reduces computational cost while incorporating global information. 3) Adaptive Velocity Decay Cache. Observing that adjacent timesteps in edited regions exhibit strong velocity similarity, we further propose an adaptive velocity decay cache to accelerate the local denoising process. We applied RegionE to state-of-the-art IIE base models, including Step1X-Edit, FLUX.1 Kontext, and Qwen-Image-Edit. RegionE achieved acceleration factors of 2.57×, 2.41×, and 2.06×, respectively, with minimal quality loss (PSNR: 30.520–32.133). Evaluations by GPT-4o also confirmed that semantic and perceptual fidelity were well preserved.

Poster

P4-#3103

Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening

Wooseok Jeon ⋅ Seunghyun Shin ⋅ Dongmin Shin ⋅ Hae-Gon Jeon

Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.

Poster

P4-#3102

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

Dongli Xu ⋅ Aleksei Tiulpin ⋅ Matthew Blaschko

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional–unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256 × 256 among autoregressive models.

Poster

P4-#3101

TPDiff: Temporal Pyramid Video Diffusion Model

Lingmin Ran ⋅ Mike Zheng Shou

The development of video diffusion models unveils a significant challenge: the substantial computational demands. To mitigate this challenge, we note that the reverse process of diffusion exhibits an inherent entropy-reducing nature. Given the inter-frame redundancy in video modality, maintaining full frame rates in high-entropy stages is unnecessary. Based on this insight, we propose TPDiff, a unified framework to enhance training and inference efficiency. By dividing diffusion into several stages, our framework progressively increases frame rate along the diffusion process with only the last stage operating on full frame rate, thereby optimizing computational efficiency. To train the multi-stage diffusion model, we introduce a dedicated training framework: stage-wise diffusion. By solving the partitioned probability flow ordinary differential equations (ODE) of diffusion under aligned data and noise, our training strategy is applicable to various diffusion forms and further enhances training efficiency. Comprehensive experimental evaluations validate the generality of our method, demonstrating 50% reduction in training cost and 1.5x improvement in inference efficiency.

Poster

P4-#3201

FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

Xinya Ji ⋅ Sebastian Weiss ⋅ Manuel Kansy ⋅ Jacek Naruniec ⋅ Xun Cao ⋅ Barbara Solenthaler ⋅ Derek Bradley

Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.

Poster

P4-#3202

PreciseCache: Precise Feature Caching for Efficient and High-fidelity Video Generation

Jiangshan Wang ⋅ Kang Zhao ⋅ Jiayi Guo ⋅ Jiayu Wang ⋅ Hang Guo ⋅ Chenyang Zhu ⋅ Xiu Li ⋅ Xiangyu Yue

High computational costs and slow inference hinder the practical application of video generation models. While prior works accelerate the generation process through feature caching, they often suffer from notable quality degradation. In this work, we reveal that this issue arises from their inability to distinguish truly redundant features, which leads to the unintended skipping of computations on important features. To address this, we propose \textbf{PreciseCache}, a plug-and-play framework that precisely detects and skips truly redundant computations, thereby accelerating inference without sacrificing quality. Specifically, PreciseCache contains two components: LFCache for step-wise caching and BlockCache for block-wise caching. For LFCache, we compute the Low-Frequency Difference (LFD) between the prediction features of the current step and those from the previous cached step. Empirically, we observe that LFD serves as an effective measure of step-wise redundancy, accurately detecting highly redundant steps whose computation can be skipped through reusing cached features. To further accelerate generation within each non-skipped step, we propose BlockCache, which precisely detects and skips redundant computations at the block level within the network. Extensive experiments on various backbones demonstrate the effectiveness of our PreciseCache, which achieves an average of $2.6\times$ speedup without noticeable quality loss. Source code will be released.

Poster

P4-#3203

Test-Time Iterative Error Correction for Efficient Diffusion Models

Yunshan Zhong ⋅ Weiqi Yan ⋅ Yuxin Zhang

With the growing demand for high-quality image generation on resource-constrained devices, efficient diffusion models have received increasing attention. However, such models suffer from approximation errors introduced by efficiency techniques, which significantly degrade generation quality. Once deployed, these errors are difficult to correct, as modifying the model is typically infeasible in deployment environments. Through an analysis of error propagation across diffusion timesteps, we reveal that these approximation errors can accumulate exponentially, severely impairing output quality. Motivated by this insight, we propose Iterative Error Correction (IEC), a novel test-time method that mitigates inference-time errors by iteratively refining the model’s output. IEC is theoretically proven to reduce error propagation from exponential to linear growth, without requiring any retraining or architectural changes. IEC can seamlessly integrate into the inference process of existing diffusion models, enabling a flexible trade-off between performance and efficiency. Extensive experiments show that IEC consistently improves generation quality across various datasets, efficiency techniques, and model architectures, establishing it as a practical and generalizable solution for test-time enhancement of efficient diffusion models. The code is available at https://github.com/zysxmu/IEC.

Poster

P4-#3204

CoDi: Subject-Consistent and Pose-Diverse Text-to-Image Generation

Zhanxin Gao ⋅ Beier Zhu ⋅ Liangyao ⋅ Jian Yang ⋅ Ying Tai

Subject-consistent generation (SCG)-aiming to maintain a consistent subject identity across diverse scenes-remains a challenge for text-to-image (T2I) models. Existing training-free SCG methods often achieve consistency at the cost of layout and pose diversity, hindering expressive visual storytelling. To address the limitation, we propose subject-Consistent and pose-Diverse T2I framework, dubbed as CoDi, that enables consistent subject generation with diverse pose and layout. Motivated by the progressive nature of diffusion, where coarse structures emerge early and fine details are refined later, CoDi adopts a two-stage strategy: Identity Transport (IT) and Identity Refinement (IR). IT operates in the early denoising steps, using optimal transport to transfer identity features to each target image in a pose-aware manner. This promotes subject consistency while preserving pose diversity. IR is applied in the later denoising steps, selecting the most salient identity features to further refine subject details. Extensive qualitative and quantitative results on subject consistency, pose diversity, and prompt fidelity demonstrate that CoDi achieves both better visual perception and stronger performance across all metrics. The code is provided in the supplementary material.

Poster

P4-#3206

Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang ⋅ Weihong Zeng ⋅ Zerong Zheng ⋅ Jiaqi Yang ⋅ Chao Liang ⋅ Wang Liao ⋅ Han Liang ⋅ Weifeng Chen ⋅ XING WANG ⋅ Yuan Zhang ⋅ Mingyuan Gao

Current video avatar models can generate fluid animations but struggle to capture a character's authentic essence, primarily synchronizing motion with low-level audio cues instead of understanding higher-level semantics like emotion or intent. To bridge this gap, we propose a novel framework for generating character animations that are not only physically plausible but also semantically rich and expressive. Our model is built on two technical innovations. First, we employ Multimodal Large Language Models to generate a structured textual representation from input conditions, providing high-level semantic guidance for creating contextually and emotionally resonant actions. Second, to ensure robust fusion of multimodal signals, we introduce a specialized Multimodal Diffusion Transformer architecture featuring a novel Pseudo Last Frame design. This allows our model to accurately interpret the joint semantics of audio, images and text, generating motions that are deeply coherent with the overall context. Comprehensive experiments validate the superiority of our method, which achieves compelling results in lip-sync accuracy, video quality, motion naturalness, and semantic consistency. The approach also shows strong generalization to challenging scenarios, including multi-person and non-human subjects. Our video results are linked in https://omnihuman-lab.github.io/v1_5/ .

Poster

P4-#3207

Latent Wavelet Diffusion For Ultra High-Resolution Image Synthesis

Luigi Sigillo ⋅ Shengfeng He ⋅ Danilo Comminiello

High-resolution image synthesis remains a core challenge in generative modeling, particularly in balancing computational efficiency with the preservation of fine-grained visual detail. We present $\textit{Latent Wavelet Diffusion (LWD)}$, a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis. LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space. This is complemented by a scale-consistent VAE objective to ensure high spectral fidelity. The primary advantage of our approach is its efficiency: LWD requires no architectural modifications and adds zero additional cost during inference, making it a practical solution for scaling existing models. Across multiple strong baselines, LWD consistently improves perceptual quality and FID scores, demonstrating the power of signal-driven supervision as a principled and efficient path toward high-resolution generative modeling.

Poster

P4-#3208

One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

Yuanzhi Zhu ⋅ Ruiqing Wang ⋅ Shilin Lu ⋅ Hanshu YAN ⋅ Junnan Li ⋅ Kai Zhang

Recent advances in diffusion and flow-based generative models have demonstrated remarkable success in image restoration tasks, achieving superior perceptual quality compared to traditional deep learning approaches. However, these methods either require numerous sampling steps to generate high-quality images, resulting in significant computational overhead, or rely on common model distillation, which usually imposes a fixed fidelity-realism trade-off and thus lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism. Our approach first trains a conditional flow-based super-resolution model to serve as a teacher model. We then distill this teacher model by applying a specialized constraint. Specifically, we force the predictions from our one-step student model for same input to lie on the same sampling ODE trajectory of the teacher model. This alignment ensures that the student model's single-step predictions from initial states match the teacher's predictions from a closer intermediate state. Through extensive experiments on datasets including FFHQ (256$\times$256), DIV2K, and ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off. Code and pre-trained models are available at \href{https://github.com/yuanzhi-zhu/OFTSR}{https://github.com/yuanzhi-zhu/OFTSR} and \href{https://huggingface.co/Yuanzhi/OFTSR}{https://huggingface.co/Yuanzhi/OFTSR}, respectively.

Poster

P4-#3209

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

Zichen Geng ⋅ Zeeshan Hayder ⋅ Bo Miao ⋅ Jian Liu ⋅ Wei Liu ⋅ Ajmal Mian

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.

Poster

P4-#3210

Inverse Virtual Try-On: Generating Multi-Category Product-Style Images from Clothed Individuals

Davide Lobba ⋅ Fulvio Sanguigni ⋅ Bin Ren ⋅ Marcella Cornia ⋅ Rita Cucchiara ⋅ Nicu Sebe

Virtual try-on (VTON) has been widely explored for rendering garments onto person images, while its inverse task, virtual try-off (VTOFF), remains largely overlooked. VTOFF aims to recover standardized product images of garments directly from photos of clothed individuals. This capability is of great practical importance for e-commerce platforms, large-scale dataset curation, and the training of foundation models. Unlike VTON, which must handle diverse poses and styles, VTOFF naturally benefits from a consistent output format in the form of flat garment images. However, existing methods face two major limitations: (i) exclusive reliance on visual cues from a single photo often leads to ambiguity, and (ii) generated images usually suffer from loss of fine details, limiting their real-world applicability. To address these challenges, we introduce TEMU-VTOFF, a Text-Enhanced MUlti-category framework for VTOFF. Our architecture is built on a dual DiT-based backbone equipped with a multimodal attention mechanism that jointly exploits image, text, and mask information to resolve visual ambiguities and enable robust feature learning across garment categories. To explicitly mitigate detail degradation, we further design an alignment module that refines garment structures and textures, ensuring high-quality outputs. Extensive experiments on VITON-HD and Dress Code show that TEMU-VTOFF achieves new state-of-the-art performance, substantially improving both visual realism and consistency with target garments. Code and models are available at: https://temu-vtoff-page.github.io/.

Poster

P4-#3211

TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding ⋅ Weirui Ye

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce **TreeGRPO**, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) *High sample efficiency*, achieving better performance under same training samples (2) *Fine-grained credit assignment* via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) *Amortized computation* where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves **2.4$\times$** faster training} while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment. The project website is available at https://treegrpo.github.io.

Poster

P4-#3212

CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow-Map Models

Zheyuan Hu ⋅ Chieh-Hsin Lai ⋅ Yuki Mitsufuji ⋅ Stefano Ermon

Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce *mid-training*, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, *Consistency Mid-Training* (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state-of-the-art two-step FIDs of 1.97 (CIFAR-10), 1.32 (ImageNet $64\times64$), and 1.84 (ImageNet $512\times512$), using up to $98$\% less training data and GPU time than CMs. On ImageNet $256\times256$, it attains 1-step FID 3.34 with $\sim50$\% less training than MF from scratch (FID 3.43). On MSCOCO T2I, CMT reaches the best FID with $\sim47$\% less training. This establishes CMT as a principled, efficient, and general framework for training flow map models. Code and models are available at https://github.com/sony/cmt.

Poster

P4-#3213

Charts Are Not Images: On the Challenges of Scientific Chart Editing

Li Li ⋅ Ryan Rossi ⋅ Sungchul Kim ⋅ Sunav Choudhary ⋅ Franck Dernoncourt ⋅ Puneet Mathur ⋅ Zhengzhong Tu ⋅ Yue Zhao

Generative models, such as diffusion and autoregressive approaches, have demonstrated impressive capabilities in editing natural images. However, applying these tools to scientific charts rests on a flawed assumption: a chart is not merely an arrangement of pixels but a visual representation of structured data governed by a graphical grammar. Consequently, chart editing is not a pixel-manipulation task but a structured transformation problem. To address this fundamental mismatch, we introduce \textit{FigEdit}, a large-scale benchmark for scientific figure editing comprising over 30,000 samples. Grounded in real-world data, our benchmark is distinguished by its diversity, covering 10 distinct chart types and a rich vocabulary of complex editing instructions. The benchmark is organized into five distinct and progressively challenging tasks: single edits, multi edits, conversational edits, visual-guidance-based edits, and style transfer. Our evaluation of a range of state-of-the-art models on this benchmark reveals their poor performance on scientific figures, as they consistently fail to handle the underlying structured transformations required for valid edits. Furthermore, our analysis indicates that traditional evaluation metrics (e.g., SSIM, PSNR) have limitations in capturing the semantic correctness of chart edits. Our benchmark demonstrates the profound limitations of pixel-level manipulation and provides a robust foundation for developing and evaluating future structure-aware models. By releasing \textit{FigEdit}, we aim to enable systematic progress in structure-aware figure editing, provide a common ground for fair comparison, and encourage future research on models that understand both the visual and semantic layers of scientific charts.

Poster

P4-#3214

DSA: Efficient Inference For Video Generation Models via Distributed Sparse Attention

Shenggui Li ⋅ Runyu Lu ⋅ qiaoling chen ⋅ Haiyan Yin ⋅ YUEMING LYU ⋅ Yonggang Wen ⋅ Ivor Tsang ⋅ Tianwei Zhang

Diffusion Transformer models have driven the rapid advances in video generation, achieving state-of-the-art quality and flexibility. However, their attention mechanism remains a major performance bottleneck, as its dense computation scales quadratically with the sequence length. To overcome this limitation and reduce the generation latency, we propose DSA, a novel attention mechanism that integrates sparse attention with distributed inference for diffusion-based video generation. By leveraging carefully-designed parallelism strategies and scheduling, DSA significantly reduces redundant computation while preserving global context. Extensive experiments on benchmark datasets demonstrate that, when deployed on 8 GPUs, DSA achieves up to 1.43× inference speedup than the existing distributed method and 10.79× faster than single-GPU inference.

Poster

P4-#3215

Geometric Image Editing via Effects-Sensitive In-Context Inpainting with Diffusion Transformers

Shuo Zhang ⋅ Wenzhuo Wu ⋅ Huayu Zhang ⋅ Cheng Jiarong ⋅ Xianghao Zang ⋅ Chao Ban ⋅ Hao Sun ⋅ Zhongjiang He ⋅ Tianwei Cao ⋅ Kongming Liang ⋅ Zhanyu Ma

Recent advances in diffusion models have significantly improved image editing. However, challenges persist in handling geometric transformations, such as translation, rotation, and scaling, particularly in complex scenes. Existing approaches suffer from two main limitations: (1) difficulty in achieving accurate geometric editing of object translation, rotation, and scaling; (2) inadequate modeling of intricate lighting and shadow effects, leading to unrealistic results. To address these issues, we propose GeoEdit, a framework that leverages in-context generation through a diffusion transformer module, which integrates geometric transformations for precise object edits. Moreover, we introduce Effects-Sensitive Attention, which enhances the modeling of intricate lighting and shadow effects for improved realism. To further support training, we construct RS-Objects, a large-scale geometric editing dataset containing over 120,000 high-quality image pairs, enabling the model to learn precise geometric editing while generating realistic lighting and shadows. Extensive experiments on public benchmarks demonstrate that GeoEdit consistently outperforms state-of-the-art methods in terms of visual quality, geometric accuracy, and realism.

Poster

P4-#3217

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak ⋅ Junho Kim ⋅ Sangdoo Yun ⋅ Dongyoon Han ⋅ Taekyung Kim ⋅ Seungryong Kim ⋅ Jin-Hwa Kim

We introduce a diffusion-based framework that generates aligned novel view images and geometries via a warping‐and‐inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models limited to in‐domain views, our method leverages off‐the‐shelf geometry predictors to predict partial geometries viewed from reference images, and formulates novel view synthesis as an inpainting task for both image and geometry. To ensure accurate alignment between the generated image and geometry, we propose cross-modal attention instillation where the attention maps from an image diffusion branch are injected into a parallel geometry diffusion branch during both training and inference. This multi-task approach achieves synergistic effects, facilitating both geometrically robust image synthesis and geometry prediction. We further introduce proximity‐based mesh conditioning to integrate depth and normal cues, interpolating between point cloud and filtering erroneously predicted geometry from influencing the generation process. Empirically, our method achieves high-fidelity extrapolative view synthesis, delivers competitive reconstruction under interpolation settings, and produces geometrically aligned point clouds as 3D completion.

Poster

P4-#3218

SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

Haithem Turki ⋅ Qi Wu ⋅ Xin Kang ⋅ Janick Martinez Esturo ⋅ Shengyu Huang ⋅ Ruilong Li ⋅ Zan Gojcic ⋅ Riccardo de Lutio

Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20$\times$ faster than ray tracing approaches and 1.5-14$\times$ faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

Poster

P4-#3308

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi LAN ⋅ Yihang Luo ⋅ Fangzhou Hong ⋅ Shangchen Zhou ⋅ Honghua Chen ⋅ Zhaoyang Lyu ⋅ Bo DAI ⋅ Shuai Yang ⋅ Chen Change Loy ⋅ Xingang Pan

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces a streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure, enabling efficient large-scale pretraining and fine-tuning for various downstream 3D tasks. Our results underscore the potential of causal Transformer models for online 3D perception, paving the way for real-time 3D understanding in streaming environments.

Poster

P4-#3318

QuadGPT: Native Quadrilateral Mesh Generation with Autoregressive Models

Jian Liu ⋅ Chunshi Wang ⋅ Song Guo ⋅ Haohan Weng ⋅ Zhen Zhou ⋅ Zhiqi Li ⋅ Jiaao Yu ⋅ Yiling Zhu ⋅ Jing Xu ⋅ Biwen Lei ⋅ Zhuo Chen ⋅ Chunchao Guo

The generation of quadrilateral-dominant meshes is a cornerstone of professional 3D content creation. However, existing generative models generate quad meshes by first generating triangle meshes and then merging triangles into quadrilaterals with some specific rules, which typically produces quad meshes with poor topology. In this paper, we introduce QuadGPT, the first autoregressive framework for generating quadrilateral meshes in an end-to-end manner. QuadGPT formulates this as a sequence prediction paradigm, distinguished by two key innovations: a unified tokenization method to handle mixed topologies of triangles and quadrilaterals, and a specialized Reinforcement Learning fine-tuning method tDPO for better generation quality. Extensive experiments demonstrate that QuadGPT significantly surpasses previous triangle-to-quad conversion pipelines in both geometric accuracy and topological quality. Our work establishes a new benchmark for native quad-mesh generation and showcases the power of combining large-scale autoregressive models with topology-aware RL refinement for creating structured 3D assets.

Poster

P4-#3317

From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Jianing Chen ⋅ Zehao Li ⋅ Yujun Cai ⋅ Hao Jiang ⋅ Shuqin Gao ⋅ Honglong Zhao ⋅ Tianlu Mao ⋅ Yucheng Zhang

Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

Poster

P4-#3316

Scaling Sequence-to-Sequence Generative Neural Rendering

Shikun Liu ⋅ Kam Woh Ng ⋅ Wonbong Jang ⋅ Jiadong Guo ⋅ Junlin Han ⋅ Haozhe Liu ⋅ Yiannis Douratsos ⋅ Juan Perez ⋅ Zijian Zhou ⋅ Khanh Chi Phung ⋅ Tao Xiang ⋅ Juan-Manuel Perez-Rua

We present Kaleido, a family of generative models designed for photorealistic, unified object- and scene-level neural rendering. Kaleido is driven by the principle of treating 3D as a specialised sub-domain of video, which we formulate purely as a sequence-to-sequence image synthesis task. Through a systemic study of scaling sequence-to-sequence generative neural rendering, we introduce key architectural innovations that enable our model to: i) perform generative view synthesis without explicit 3D representations; ii) generate any number of 6-DoF target views conditioned on any number of reference views via a masked autoregressive framework; and iii) seamlessly unify 3D and video modelling within a single decoder-only rectified flow transformer. Within this unified framework, Kaleido leverages large-scale video data for pre-training, which significantly improves spatial consistency and reduces reliance on scarce, camera-labelled 3D datasets --- all without any architectural modifications. Kaleido sets a new state-of-the-art on a range of view synthesis benchmarks. Its zero-shot performance substantially outperforms other generative methods in few-view settings, and, for the first time, matches the quality of per-scene optimisation methods in many-view settings. For supplementary materials, including Kaleido's generated renderings and videos, please refer to our website: https://shikun.io/projects/kaleido.

Poster

P4-#3315

Quantized Visual Geometry Grounded Transformer

Weilun Feng ⋅ Haotong Qin ⋅ Mingqiang Wu ⋅ Chuanguang Yang ⋅ Yuqi Li ⋅ Xiangqi Li ⋅ Zhulin An ⋅ Libo Huang ⋅ Yulun Zhang ⋅ Michele Magno ⋅ Yongjun Xu

Learning-based 3D reconstruction models, represented by Visual Geometry Grounded Transformers (VGGTs), have achieved remarkable progress with large-scale transformers. Their prohibitive computational and memory costs severely hinder real-world deployment. Post-Training Quantization (PTQ) has emerged as a common practice to compress and accelerate models. However, we empirically observe that PTQ faces unique obstacles when compressing billion-scale VGGTs: the data-independent special tokens induce heavy-tailed activation distributions, while the multi-view nature of 3D data makes calibration sample selection highly unstable. This paper proposes the first **Quant**ization framework for **VGGT**s, namely **QuantVGGT**. This mainly relies on two technical contributions: First, we introduce *Dual-Smoothed Fine-Grained Quantization*, which integrates pre-global Hadamard rotation and post-local channel smoothing to robustly mitigate heavy-tailed distributions and inter-channel variance. Second, we design *Noise-Filtered Diverse Sampling*, which filters outliers via deep-layer statistics and constructs frame-aware diverse calibration clusters to ensure stable quantization ranges. Comprehensive experiments demonstrate that QuantVGGT achieves the state-of-the-art results across different benchmarks and bit-width, surpassing the previous state-of-the-art generic quantization method with a great margin. We highlight that our 4-bit QuantVGGT can deliver a **3.7$\times$** memory reduction and **2.5$\times$** acceleration in real-hardware inference, while preserving over **98\%** reconstruction accuracy of the full-precision counterparts. This demonstrates the vast advantages and practicality of QuantVGGT in resource-constrained scenarios.

Poster

P4-#3314

ODE-GS: Latent ODEs for Dynamic Scene Extrapolation with 3D Gaussian Splatting

Daniel Wang ⋅ Patrick Rim ⋅ Tian Tian ⋅ Dong Lao ⋅ Alex Wong ⋅ Ganesh Sundaramoorthi

We introduce ODE-GS, a novel approach that integrates 3D Gaussian Splatting with latent neural ordinary differential equations (ODEs) to enable future extrapolation of dynamic 3D scenes. Unlike existing dynamic scene reconstruction methods, which rely on time-conditioned deformation networks and are limited to interpolation within a fixed time window, ODE-GS eliminates timestamp dependency by modeling Gaussian parameter trajectories as continuous-time latent dynamics. Our approach first learns an interpolation model to generate accurate Gaussian trajectories within the observed window, then trains a Transformer encoder to aggregate past trajectories into a latent state evolved via a neural ODE. Finally, numerical integration produces smooth, physically plausible future Gaussian trajectories, enabling rendering at arbitrary future timestamps. On the D-NeRF, NVFi, and HyperNeRF benchmarks, ODE-GS achieves state-of-the-art extrapolation performance, improving metrics by 19.8% compared to leading baselines, demonstrating its ability to accurately represent and predict 3D scene dynamics.

Poster

P4-#3313

Light of Normals: Unified Feature Representation for Universal Photometric Stereo

Houyuan Chen ⋅ Hong Li ⋅ Chongjie Ye ⋅ Zhaoxi Chen ⋅ Bohan Li ⋅ Shaocong Xu ⋅ xianda guo ⋅ Xuhui Liu ⋅ Yikai Wang ⋅ Baochang Zhang ⋅ Satoshi Ikehata ⋅ Boxin Shi ⋅ Anyi Rao ⋅ Hao Zhao

Universal photometric stereo (PS) is defined by two factors: it must (i) operate under arbitrary, unknown lighting conditions and (ii) avoid reliance on specific illumination models. Despite progress (e.g., SDM UniPS), two challenges remain. First, current encoders cannot guarantee that illumination and normal information are decoupled. To enforce decoupling, we introduce LINO UniPS with two key components: (i) Light Register Tokens with light alignment supervision to aggregate point, direction, and environment lights; (ii) Interleaved Attention Block featuring global cross-image attention that takes all lighting conditions together so the encoder can factor out lighting while retaining normal-related evidence. Second, high-frequency geometric details are easily lost. We address this with (i) a Wavelet-based Dual-branch Architecture and (ii) a Normal-gradient Perception Loss. These techniques yield a \textbf{unified} feature space in which lighting is explicitly represented by register tokens, while normal details are preserved via wavelet branch. We further introduce PS-Verse, a large-scale synthetic dataset graded by geometric complexity and lighting diversity, and adopt curriculum training from simple to complex scenes. Extensive experiments show new state-of-the-art results on public benchmarks (e.g., DiLiGenT, Luces), stronger generalization to real materials, and improved efficiency; ablations confirm that Light Register Tokens + Interleaved Attention Block drive better feature decoupling, while Wavelet-based Dual-branch Architecture + Normal-gradient Perception Loss recover finer details.

Poster

P4-#3312

A^2TG: Adaptive Anisotropic Textured Gaussians for Efficient 3D Scene Representation

Sheng-Chi Hsu ⋅ Ting-Yu Yen ⋅ Shih-Hsuan Hung ⋅ Hung-Kuo Chu

Gaussian Splatting has emerged as a powerful representation for high-quality, real-time 3D scene rendering. While recent works extend Gaussians with learnable textures to enrich visual appearance, existing approaches allocate a fixed square texture per primitive, leading to inefficient memory usage and limited adaptability to scene variability. In this paper, we introduce adaptive anisotropic textured Gaussians (A$^2$TG), a novel representation that generalizes textured Gaussians by equipping each primitive with an anisotropic texture. Our method employs a gradient-guided adaptive rule to jointly determine texture resolution and aspect ratio, enabling non-uniform, detail-aware allocation that aligns with the anisotropic nature of Gaussian splats. This design significantly improves texture efficiency, reducing memory consumption while enhancing image quality. Experiments on multiple benchmark datasets demonstrate that A$^2$TG consistently outperforms fixed-texture Gaussian Splatting methods, achieving comparable rendering fidelity with substantially lower memory requirements.

Poster

P4-#3311

Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image

Ming Qian ⋅ Zimin Xia ⋅ Changkun Liu ⋅ Shuailei Ma ⋅ Wen Wang ⋅ Zeran Ke ⋅ Bin Tan ⋅ Hang Zhang ⋅ Gui-Song Xia

Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. {\revisioncolor For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m.} Crucially, this geometric leap also boosts photorealism, reducing the Fr\'echet Inception Distance (FID) from $\sim$40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code will be released on https://github.com/qianmingduowan/Sat3DGen.

Poster

P4-#3309

IncVGGT: Incremental VGGT for Memory-Bounded Long-Range 3D Reconstruction

Keyu Fang ⋅ Changchun Zhou ⋅ Yuzhe Fu ⋅ Hai Li ⋅ Yiran Chen

We present IncVGGT, a training-free incremental variant of VGGT that makes transformer-based 3D reconstruction feasible for long sequences in real-world applications. Vanilla VGGT relies on dense global attention, which causes memory to grow quadratically and requires excessive computation, making it impractical for long-sequence scenarios. Even evolved streaming variants, such as StreamVGGT, still suffer from rapidly growing cache and latency. IncVGGT addresses these challenges from two orthogonal directions: (1) register and fuse overlapping frames into composite views, reducing duplicate tokens, and (2) history-side pruning retains only the top-$k$ most relevant/maximum slots together with the most recent one, bounding cache growth. This incremental and memory-efficient design minimizes computation and memory occupation across arbitrarily long sequences. Compared to StreamVGGT, IncVGGT sustains arbitrarily long sequences with large efficiency gains (e.g., on 500-frame sequences, 58.5$\times$ fewer operators, 9$\times$ lower memory, 25.7$\times$ less energy, and 4.9$\times$ faster inference) while maintaining comparable accuracy. More importantly, unlike existing baselines that directly run out of memory beyond 300 (VGGT)–500 (StreamVGGT) frames, IncVGGT continues to operate smoothly even on 10k-frame inputs under an 80GB GPU, showing that our design truly scales to ultra-long sequences without hitting memory limits. These results highlight IncVGGT’s potential for deployment in resource-constrained edge devices for long-range 3D scenarios.

Poster

P4-#3012

3DGEER: 3D Gaussian Rendering Made Exact and Efficient for Generic Cameras

Zixun Huang ⋅ Cho-Ying Wu ⋅ Yuliang Guo ⋅ Xinyu Huang ⋅ Liu Ren

3D Gaussian Splatting (3DGS) achieves an appealing balance between rendering quality and efficiency, but relies on approximating 3D Gaussians as 2D projections—an assumption that degrades accuracy, especially under generic large field-of-view (FoV) cameras. Despite recent extensions, no prior work has simultaneously achieved both projective exactness and real-time efficiency for general cameras. We introduce 3DGEER, a geometrically exact and efficient Gaussian rendering framework. From first principles, we derive a closed-form expression for integrating Gaussian density along a ray, enabling precise forward rendering and differentiable optimization under arbitrary camera models. To retain efficiency, we propose the Particle Bounding Frustum (PBF), which provides tight ray–Gaussian association without BVH traversal, and the Bipolar Equiangular Projection (BEAP), which unifies FoV representations, accelerates association, and improves reconstruction quality. Experiments on both pinhole and fisheye datasets show that 3DGEER outperforms prior methods across all metrics, runs 5x faster than existing projective exact ray-based baselines, and generalizes to wider FoVs unseen during training—establishing a new state of the art in real-time radiance field rendering.

Poster

P4-#3307

Streaming Visual Geometry Transformer

Dong Zhuo ⋅ Wenzhao Zheng ⋅ Jiahe Guo ⋅ Yuqi Wu ⋅ Jie Zhou ⋅ Jiwen Lu

Perceiving and reconstructing 3D geometry from videos is a fundamental yet challenging computer vision task. To facilitate interactive and low-latency applications, we propose a streaming visual geometry transformer that shares a similar philosophy with autoregressive large language models. We explore a simple and efficient design and employ a causal transformer architecture to process the input sequence in an online manner. We use temporal causal attention and cache the historical keys and values as implicit memory to enable efficient streaming long-term 3D reconstruction. This design can handle low-latency 3D reconstruction by incrementally integrating historical information while maintaining high-quality spatial consistency. For efficient training, we propose to distill knowledge from the dense bidirectional visual geometry grounded transformer (VGGT) to our causal model. For inference, our model supports the migration of optimized efficient attention operators (e.g., FlashAttention) from large language models. Extensive experiments on various 3D geometry perception benchmarks demonstrate that our model enhances inference speed in online scenarios while maintaining competitive performance, thereby facilitating scalable and interactive 3D vision systems.

Poster

P4-#3306

Dynamic Novel View Synthesis in High Dynamic Range

Kaixuan Zhang ⋅ Zhipeng Xiong ⋅ Minxian Li ⋅ Mingwu Ren ⋅ Jiankang Deng ⋅ Xiatian Zhu

High Dynamic Range Novel View Synthesis (HDR NVS) seeks to learn an HDR 3D model from Low Dynamic Range (LDR) training images captured under conventional imaging conditions. Current methods primarily focus on static scenes, implicitly assuming all scene elements remain stationary and non-living. However, real-world scenarios frequently feature dynamic elements, such as moving objects, varying lighting conditions, and other temporal events, thereby presenting a significantly more challenging scenario. To address this gap, we propose a more realistic problem named HDR Dynamic Novel View Synthesis (HDR DNVS), where the additional dimension ``Dynamic'' emphasizes the necessity of jointly modeling temporal radiance variations alongside sophisticated 3D translation between LDR and HDR. To tackle this complex, intertwined challenge, we introduce HDR-4DGS, a Gaussian Splatting-based architecture featured with an innovative dynamic tone-mapping module that explicitly connects HDR and LDR domains, maintaining temporal radiance coherence by dynamically adapting tone-mapping functions according to the evolving radiance distributions across the temporal dimension. As a result, HDR-4DGS achieves both temporal radiance consistency and spatially accurate color translation, enabling photorealistic HDR renderings from arbitrary viewpoints and time instances. Extensive experiments demonstrate that HDR-4DGS surpasses existing state-of-the-art methods in both quantitative performance and visual fidelity. Source code is available at \url{https://github.com/prinasi/HDR-4DGS}.

Poster

P4-#3305

S2GO: Streaming Sparse Gaussian Occupancy

Jinhyung Park ⋅ Chensheng Peng ⋅ yihan hu ⋅ Wenzhao Zheng ⋅ Kris Kitani ⋅ Wei Zhan

Despite the efficiency and performance of sparse query-based representations for detection, state-of-the-art 3D occupancy estimation methods still rely on voxel-based or dense Gaussian-based 3D representations. However, dense representations are slow, and they lack flexibility in capturing the temporal dynamics of driving scenes. Distinct from prior work, we instead summarize the scene into a compact set of 3D queries which are propagated through time in an online, streaming fashion. These queries are then decoded into semantic Gaussians at each timestep. We couple our framework with a denoising rendering objective to guide the queries and their constituent Gaussians in effectively capturing scene geometry. Due to its efficient, query-based representation, S2GO achieves state-of-the-art performance on the nuScenes and KITTI occupancy benchmarks, outperforming prior art (e.g., GaussianWorld) by 2.7 IoU with 4.5x faster inference.

Poster

P4-#3304

pySpatial: Generating 3D Visual Programs for Zero-Shot Spatial Reasoning

Zhanpeng Luo ⋅ Ce Zhang ⋅ Silong Yong ⋅ Cunxi Dai ⋅ Qianwei Wang ⋅ Haoxi Ran ⋅ Guanya Shi ⋅ Katia Sycara ⋅ Yaqi Xie

Multi-modal Large Language Models (MLLMs) have demonstrated strong capabilities in general-purpose perception and reasoning, but they still struggle with tasks that require spatial understanding of the 3D world. To address this, we introduce pySpatial, a visual programming framework that equips MLLMs with the ability to interface with spatial tools via Python code generation. Given an image sequence and a natural-language query, the model composes function calls to spatial tools including 3D reconstruction, camera-pose recovery, novel-view rendering, etc. These operations convert raw 2D inputs into an explorable 3D scene, enabling MLLMs to reason explicitly over structured spatial representations. Notably, pySpatial requires no gradient-based fine-tuning and operates in a fully zero-shot setting. Experimental evaluations on the challenging MindCube and Omni3D-Bench benchmarks demonstrate that our framework pySpatial consistently surpasses strong MLLM baselines; for instance, it outperforms GPT-4.1-mini by 12.94% on MindCube. Furthermore, we conduct real-world indoor navigation experiments where the robot can successfully traverse complex environments using route plans generated by pySpatial, highlighting the practical effectiveness of our approach. Our project website will be available at https://pySpatial.github.io.

Poster

P4-#3303

MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

Kejian Zhu ⋅ Zhuoran Jin ⋅ Hongbang Yuan ⋅ Jiachun Li ⋅ Shangqing Tu ⋅ Pengfei Cao ⋅ Yubo Chen ⋅ Kang Liu ⋅ Jun Zhao

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame'') and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, Gemini-2.5-pro, achieves only 64.3% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Error analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.

Poster

P4-#3302

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Congzhi Zhang ⋅ Zhibin Wang ⋅ Yinchao Ma ⋅ Jiawei Peng ⋅ Yihan Wang ⋅ Qiang Zhou ⋅ Jun Song ⋅ Bo Zheng

While Reinforcement Learning with Verifiable Reward (RLVR) significantly advances image reasoning in Large Vision-Language Models (LVLMs), its application to complex video reasoning remains underdeveloped. This gap stems primarily from a critical data bottleneck: existing datasets lack the challenging, multi-hop questions and high-quality, video-grounded Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address this, we introduce ReWatch, a large-scale dataset built to foster advanced video reasoning. We propose a novel multi-stage synthesis pipeline to synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which simulates a human-like "re-watching" process to generate video-grounded reasoning traces by explicitly modeling information retrieval and verification. Building on this dataset, we develop ReWatch-R1 by post-training a strong baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This framework incorporates a novel Observation & Reasoning (O&R) reward mechanism that evaluates both the final answer's correctness and the reasoning's alignment with video content, directly penalizing hallucination. Our experiments show that ReWatch-R1 achieves state-of-the-art average performance on five challenging video reasoning benchmarks.

Poster

P4-#3301

DTP: Delta-Guided Two Stage Pruning for Mamba-based Multimodal Large Language Models

Seong Yeol Park ⋅ Kwon Min Jung ⋅ XIANGHUA PIAO ⋅ Yeong Hyeon Gu

Multimodal large language models built on the Mamba architecture offer efficiency advantages, yet remain hampered by redundant visual tokens that inflate inference cost, with the prefill stage accounting for the majority of total inference time. We introduce Delta-guided Two stage Pruning (DTP), a method that progressively reduces token redundancy through selective pruning at early layer and complete pruning at late layer. Unlike Transformer-oriented pruning methods, our approach derives token importance directly from Mamba’s internal parameters. The statistical distribution of these importance scores, combined with implicit attention patterns, then provides the basis for determining both the pruning layers and the tokens to be removed. Extensive evaluation across diverse benchmarks shows that DTP cuts computation by nearly 50\%, maintains higher task performance than existing pruning methods, and further achieves over a 35\% reduction in prefill latency. Beyond efficiency, our analysis reveals previously underexplored behaviors of visual tokens within Mamba layers, suggesting a principled perspective for designing future pruning techniques in Mamba-based Multimodal Large Language Models.

Poster

P4-#3401

UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling

Zhihao Sun ⋅ Tong Wu ⋅ Ruirui Tu ⋅ Daoguo Dong ⋅ Zuxuan Wu

Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.

Poster

P4-#3402

FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

Janghoon Cho ⋅ Jungsoo Lee ⋅ Munawar Hayat ⋅ Kyuwoong Hwang ⋅ Fatih Porikli ⋅ Sungha Choi

Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, LongVideoBench, and EgoSchema, show that our framework consistently surpasses recent compression techniques, highlighting its effectiveness and robustness in addressing the challenges of long video understanding as well as its processing efficiency.

Poster

P4-#3403

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Lianyu Hu ⋅ Liqing Gao ⋅ Fanhua Shang ⋅ Liang Wan ⋅ Wei Feng

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or within the Large Language Model (LLM) stage to lower computational cost. This overlooks other major bottlenecks, particularly the image encoder, which itself requires substantial computation. As a result, these methods fall short of achieving true end-to-end acceleration. Importantly, the image encoder is the primary contributor of input tokens to the LLM. Thus, reducing visual redundancy at the encoder stage not only speeds up the encoder itself but also significantly lightens the workload for the subsequent LLM. Motivated by this, we investigate how to jointly optimize the image encoder and the LLM along with other LVLM components for comprehensive acceleration. To mitigate the risk of performance degradation from token reduction, we propose a novel token merging strategy that recycles useful information from otherwise discarded tokens. Our approach, iLLaVA, delivers consistent improvements across both image and video understanding tasks, achieving up to a 2$\times$ throughput boost and a 4$\times$ reduction in prefilling time. Notably, iLLaVA enables a larger model (e.g., InternVL-2.5 26B) to surpass a smaller counterpart (e.g., InternVL-2.5 8B) in both accuracy and efficiency. Extensive comparisons with state-of-the-art token pruning and merging techniques demonstrate the clear superiority of our method. Finally, we provide detailed visualizations for the merging steps of iLLaVA , offering deeper insights into how different LVLM components contribute to efficient computation.

Poster

P4-#3405

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Yuyou Zhang ⋅ Radu Corcodel ⋅ Chiori Hori ⋅ Anoop Cherian ⋅ DING ZHAO

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 43 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models. Our website can be found here.

Poster

P4-#3406

DriveAgent-R1: Advancing VLM-based Autonomous Driving with Active Perception and Hybrid Thinking

Weicheng Zheng ⋅ Xiaofei Mao ⋅ Nanfei Ye ⋅ Pengxiang Li ⋅ Kun Zhan ⋅ XianPeng Lang ⋅ Hang Zhao

The advent of Vision-Language Models (VLMs) has significantly advanced end-to-end autonomous driving, demonstrating powerful reasoning abilities for high-level behavior planning tasks. However, existing methods are often constrained by a passive perception paradigm, relying solely on text-based reasoning. This passivity restricts the model’s capacity to actively seek crucial visual evidence when faced with uncertainty. To address this, we introduce DriveAgent-R1, an autonomous driving agent capable of active perception for planning. In complex scenarios, DriveAgent-R1 proactively invokes tools to perform visual reasoning, firmly grounding its decisions in visual evidence, thereby enhancing both interpretability and reliability. Furthermore, we propose a hybrid thinking framework, inspired by human driver cognitive patterns, allowing the agent to adaptively switch between efficient text-only reasoning and robust tool-augmented visual reasoning based on scene complexity. This capability is cultivated through a three-stage progressive training strategy, featuring a core Cascaded Reinforcement Learning (Cascaded RL) phase. Extensive experiments on the Drive-Internal dataset, which is rich in long-tail scenarios, and the public nuScenes dataset show that, with only 3B parameters, DriveAgent-R1 achieves competitive performance comparable to top closed model systems such as GPT-5 and to human driving proficiency while remaining deployment-friendly, offering a proven path toward building more intelligent autonomous driving systems.

Poster

P4-#3407

lmgame-Bench: How Good are LLMs at Playing Games?

Lanxiang Hu ⋅ Mingjia Huo ⋅ Yuxuan Zhang ⋅ Haoyang Yu ⋅ Eric P Xing ⋅ Ion Stoica ⋅ Tajana Rosing ⋅ Haojian Jin ⋅ Hao Zhang

Playing video games requires perception, reasoning, memory, and long-horizon planning—exactly the faculties expected of modern large language and vision–language models (LLMs/VLMs). We introduce LMGame-Bench, a benchmark built on six popular games spanning platformer, puzzle, and narrative games through a unified Gym‑style API. Unlike prior game benchmarks that entangle multiple skills, LMGame-Bench employs a modular harness—including perception, memory, and reasoning modules—that can be toggled to selectively probe distinct capabilities. The benchmark further improves robustness through prompt standardization and contamination mitigation. Evaluation of 13 state-of-the-art models demonstrates that LMGame-Bench remains challenging yet effectively discriminates among models. Correlation analysis reveals that individual games align with core LLM capabilities, providing a quantitative framework for interpreting performance. Finally, LMGame-Bench exposes models’ limitations in visual state extraction, reflection, spatiotemporal reasoning, and long-context reasoning, pointing to concrete directions for model improvement.

Poster

P4-#3408

PlantRSR: A New Plant Dataset and Method for Reference-based Super-Resolution

Hongyang Zhou ⋅ Xiaobin Zhu ⋅ Shengxiang Yu ⋅ Liuling Chen ⋅ Jingyan Qin ⋅ Xu-Cheng Yin

Single image super-resolution (SISR) often struggles to reconstruct high-resolution (HR) details from heavily degraded low-resolution (LR) inputs. Instead, reference-based super-resolution (RefSR) methods offer an alternative solution to generate promising results using high-quality reference (Ref) images to guide reconstruction. However, existing RefSR datasets focus on limited scene types, primarily featuring human activities and architectural scenes. Plant scenes exhibit complex textures and fine details, essential for advancing RefSR in natural and highly detailed scenes. To this end, we meticulously captured and manually selected high-quality images containing rich textures to construct a large-scale plant dataset, PlantRSR, comprising 16,585 HR–Ref pairs. The dataset captures the complexity and variability of plant scenes through extensive variations. In addition, we propose a novel RefSR method specifically designed to tackle the distinct challenges posed by plant imagery. It incorporates a Selective Key-Region Matching (SKRM) that selectively identifies and performs matching between LR and Ref images, focusing on distinctive botanical textures to improve matching efficiency. Additionally, a Texture-Guided Diffusion Module (TGDM) is proposed to refine LR textures by leveraging a diffusion process conditioned on the matched Ref textures. TGDM is effective in modeling irregular and fine textures, thereby facilitating more accurate SR results. The proposed method achieves significant improvements over state-of-the-art (SOTA) approaches on our PlantRSR dataset and other Benchmarks.

Poster

P4-#3409

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Longrong Yang ⋅ Zhixiong Zeng ⋅ Yufeng Zhong ⋅ Jing Huang ⋅ Liming Zheng ⋅ Lei Chen ⋅ Haibo Qiu ⋅ Zequn Qin ⋅ Lin Ma ⋅ Xi Li

Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent, OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks.

Poster

P4-#3410

VGR: Visual Grounded Reasoning

Jiacong Wang ⋅ Zijian Kang ⋅ Haochen Wang ⋅ Xiao Liang ⋅ Ya Wang ⋅ Jiawen Li ⋅ Bohong Wu ⋅ Jiao Ran ⋅ Haiyong Jiang ⋅ Chao Feng ⋅ Jun Xiao

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure linguistic space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) that can replay the visual memory during thinking just like humans. Unlike traditional MLLMs, VGR first thinks the question and detects relevant regions that may help solve problems, then, the visual memory from the critical area is extracted to assist reasoning. To achieve this, we curate a large-scale SFT dataset called VGR-SFT that contains reasoning data with mixed vision grounding and language deduction. This teaches VGR to think and actively choose grounding areas for key information before answering, and we propose a dynamic visual memory replay stage to integrates the corresponding information into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multimodal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and +12.9 improvement on ChartQA. The data is available at https://huggingface.co/BytedanceDouyinContent/VGR.

Poster

P4-#3411

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang ⋅ Xuehang Guo ⋅ Sofia Stoica ⋅ Haiyang Xu ⋅ Hongru WANG ⋅ Hyeonjeong Ha ⋅ Xiusi Chen ⋅ Yangyi Chen ⋅ Ming Yan ⋅ Fei Huang ⋅ Heng Ji

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for empowering Large Language Models (LLMs) with long chain-of-thought reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error (67%) in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to generate visually grounded reasoning without external supervision. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which maximizes the difference between two probability distributions over the same rollout sequence, conditioned on either the original or corrupted visual input. Notably, PAPO does not rely on any additional data annotation, reward models, or stronger teacher models, and can therefore be seamlessly integrated into mainstream RLVR algorithms such as GRPO and DAPO. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, PAPO offers a new perspective on advancing multimodal RLVR via the optimization objective, moving beyond rollout or reward design and pointing toward deeper integration of perception and reasoning.

Poster

P4-#3412

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

Qihua Dong ⋅ Kuo Yang ⋅ Lin Ju ⋅ Handong Zhao ⋅ Yitian Zhang ⋅ Yizhou Wang ⋅ Huimin Zeng ⋅ Jianglin Lu ⋅ Yun Fu

Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reason- ing demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains various expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs. The dataset is available at \url{https://ref-adv.github.io/}.

Poster

P4-#3414

Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation

Hongtao Yu ⋅ Yuxin Peng ⋅ Serge Belongie ⋅ Xiu-Shen Wei

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable multimodal perception capabilities, garnering significant attention. While numerous evaluation studies have emerged, assessing LVLMs both holistically and on specialized tasks, fine-grained image tasks—fundamental to computer vision—remain largely unexplored. To fill this gap, we introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.28 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives, focusing on their semantic recognition and fine-grained feature representation capabilities. Through extensive experiments on twelve representative LVLMs/VLMs, we uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance. This work provides critical insights into the limitations of current LVLMs and offers guidance for future data construction and model design in the development of more advanced LVLMs. Our code is open-source and available at https://github.com/SEU-VIPGroup/FG-BMK.

Poster

P4-#3415

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Yizhou WANG ⋅ Song Mao ⋅ Yang Chen ⋅ Yufan Shen ⋅ Pinlong Cai ⋅ Ding Wang ⋅ Guohang Yan ⋅ Zhi Yu ⋅ Yinqiao Yan ⋅ Xuming Hu ⋅ Botian Shi

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi-encoder MLLMs, we find that performance typically degrades gracefully—and sometimes even improves—when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder’s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR & Chart, where a single encoder can dominate with a CUR >90%, (ii) high redundancy on general VQA and knowledge-based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16% higher accuracy on a specific task category and 3.6% overall performance boost compared to the full model. Furthermore, single- and dual- encoder variants recover over 90% of baseline on most non-OCR tasks. Our analysis challenges the “more encoders are better” heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

Poster

P4-#3416

SCUBA: Salesforce Computer Use Benchmark

Yutong Dai ⋅ Krithika Ramakrishnan ⋅ Jing Gu ⋅ Matthew Fernandez ⋅ Yanqi Luo ⋅ Viraj Prabhu ⋅ Zhenyu Hu ⋅ silvio savarese ⋅ Caiming Xiong ⋅ Zeyuan Chen ⋅ Ran Xu

We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas—platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigm and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5\% success rate on SCUBA, while methods built on closed-source models can still have up to 39\% percent task success rate. In the demonstration-augmented settings, task success rates can be improved to 50\% while simultaneously reducing time and costs by 13\% and 16\%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.

Poster

P4-#3417

ExPO-HM: Learning to Explain-then-Detect for Hateful Meme Detection

Jingbiao Mei ⋅ Mingsheng Sun ⋅ Jinghong Chen ⋅ Pengda Qin ⋅ Yuhong Li ⋅ Da Chen ⋅ Bill Byrne

Hateful memes have emerged as a particularly challenging form of online abuse, motivating the development of automated detection systems. Most prior approaches rely on direct detection, producing only binary predictions. Such models fail to provide the context and explanations that real-world moderation requires. Recent Explain-then-Detect approaches, using Chain-of-Thought prompting or LMM agents, perform worse than simple SFT baselines, and even advanced post-training methods such as GRPO fail to close the gap. Our analysis identifies two key issues of such systems: important policy-relevant cues such as targets and attack types are not hypothesized by the model as a likely explanation; and the binary reward signal is insufficient to guide reasoning. To address these challenges, we propose ExPO-HM (Explain-then-Detect Policy Optimization for Hateful Memes), inspired by the training and evaluation process of human annotators. ExPO-HM combines SFT warmup, GRPO with curriculum learning, and Conditional Decision Entropy (CDE) as both metric and reward for reasoning quality. Across three hateful meme benchmarks, ExPO-HM achieves state-of-the-art performance on binary detection, fine-grained classification, and reasoning quality, with up to 15\% and 17\% F1 improvement over the GRPO and DPO baselines, respectively. By moving hateful meme detection from simple binary alarms to explanation-driven detection, ExPO-HM provides accurate, interpretable, and actionable moderation support. Code available at: https://github.com/JingbiaoMei/ExPO-HM

Poster

P4-#3418

Uncover Underlying Correspondence for Robust Multi-view Clustering

Haochen Zhou ⋅ Guofeng Ding ⋅ Mouxing Yang ⋅ Peng Hu ⋅ Yijie Lin ⋅ Xi Peng

Multi-view clustering (MVC) aims to group unlabeled data into semantically meaningful clusters by leveraging cross-view consistency. However, real-world datasets collected from the web often suffer from noisy correspondence (NC), which breaks the consistency prior and results in unreliable alignments. In this paper, we identify two critical forms of NC that particularly harm clustering: i) category-level mismatch, where semantically consistent samples from the same class are mistakenly treated as negatives; and ii) sample-level mismatch, where collected cross-view pairs are misaligned and some samples may even lack any valid counterpart. To address these challenges, we propose \textbf{CorreGen}, a generative framework that formulates noisy correspondence learning in MVC as maximum likelihood estimation over underlying cross-view correspondences. The objective is elegantly solved via an Expectation–Maximization algorithm: in the E-step, soft correspondence distributions are inferred across views, capturing class-level relations while adaptively down-weighting noisy or unalignable samples through GMM-guided marginals; in the M-step, the embedding network is updated to maximize the expected log-likelihood. Extensive experiments on both synthetic and real-world noisy datasets demonstrate that our method significantly improves clustering robustness. The code is available at https://github.com/XLearning-SCU/2026-ICLR-CorreGen.

Poster

P4-#3518

Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models

Jianghao Yin ⋅ Qin Chen ⋅ Kedi Chen ⋅ Jie Zhou ⋅ Xingjiao Wu ⋅ Liang He

Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.

Poster

P4-#3517

AnyUp: Universal Feature Upsampling

Thomas Wimmer ⋅ Prune Truong ⋅ Marie-Julie Rakotosaona ⋅ Michael Oechsle ⋅ Federico Tombari ⋅ Bernt Schiele ⋅ Jan Eric Lenssen

We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

Poster

P4-#3516

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Xinyu Tian ⋅ Shu Zou ⋅ Zhaoyuan Yang ⋅ Mengqi He ⋅ Fabian Waschkowski ⋅ Lukas Wesemann ⋅ Peter Tu ⋅ Jing Zhang

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, longer reasoning length may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning ength causes models to disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on various benchmarks.

Poster

P4-#3515

UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding

Yueming Xu ⋅ Jiahui Zhang ⋅ Ze Huang ⋅ Yurui Chen ⋅ Yanpeng Zhou ⋅ Dave Chen ⋅ Yu-Jie Yuan ⋅ Pengxiang Xia ⋅ Guowei Huang ⋅ Xinyue Cai ⋅ Zhongang Qi ⋅ Xingyue Quan ⋅ Jianye Hao ⋅ Hang Xu ⋅ Li Zhang

Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation.

Poster

P4-#3514

WIMFRIS: WIndow Mamba Fusion and Parameter Efficient Tuning for Referring Image Segmentation

Seunghun Moon ⋅ Hyunwoo Yu ⋅ Haeuk Lee ⋅ Suk-Ju Kang

Existing Parameter-Efficient Tuning (PET) methods for Referring Image Segmentation (RIS) primarily focus on layer-wise feature alignment, often neglecting the crucial role of a neck module for the intermediate fusion of aggregated multi-scale features, which creates a significant performance bottleneck. To address this limitation, we introduce WIMFRIS, a novel framework that establishes a powerful neck architecture alongside a simple yet effective PET strategy. At its core is our proposed HMF block, which first aggregates multi-scale features and then employs a novel WMF module to perform effective intermediate fusion. This WMF module leverages non-overlapping window partitioning to mitigate the information decay problem inherent in SSMs while ensuring rich local-global context interaction. Furthermore, our PET strategy enhances primary alignment with a MTA for robust textual priors, a MSA for precise vision-language fusion, and learnable emphasis parameters for adaptive stage-wise feature weighting. Extensive experiments demonstrate that WIMFRIS achieves new state-of-the-art performance across all public RIS benchmarks.

Poster

P4-#3513

InfoDet: A Dataset for Infographic Element Detection

Jiangning Zhu ⋅ Yuxing Zhou ⋅ Zheng Wang ⋅ Juntao Yao ⋅ Yima Gu ⋅ Yuhui Yuan ⋅ Shixia Liu

Given the central role of charts in scientific, business, and communication contexts, enhancing the chart understanding capabilities of vision-language models (VLMs) has become increasingly critical. A key limitation of existing VLMs lies in their inaccurate visual grounding of infographic elements, including charts and human-recognizable objects (HROs) such as icons and images. However, chart understanding often requires identifying relevant elements and reasoning over them. To address this limitation, we introduce InfoDet, a dataset designed to support the development of accurate object detection models for charts and HROs in infographics. It contains 11,264 real and 90,000 synthetic infographics, with over 14 million bounding box annotations. These annotations are created by combining the model-in-the-loop and programmatic methods. We demonstrate the usefulness of InfoDet through three applications: 1) constructing a Thinking-with-Boxes scheme to boost the chart understanding performance of VLMs, 2) comparing existing object detection models, and 3) applying the developed detection model to document layout and UI element detection.

Poster

P4-#3512

CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild

Balamurugan Thambiraja ⋅ Omid Taheri ⋅ Radek Danecek ⋅ Giorgio Becherini ⋅ Gerard Pons-Moll ⋅ Justus Thies

Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to “in-the-wild” settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text–motion alignment. To address this, we (1) introduce ‘3D Hands in the Wild’ (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D- HIW, we propose a data annotation pipeline that combines vision–language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part–modality decomposed VQ- VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to- motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.

Poster

P4-#3511

MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents

Xijia Tao ⋅ Teng Yihua ⋅ Xinxing Su ⋅ Xinyu Fu ⋅ Jihao Wu ⋅ Chaofan Tao ⋅ Ziru Liu ⋅ Haoli Bai ⋅ Rui Liu ⋅ Lingpeng Kong

Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image–text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning. We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.

Poster

P4-#3510

PoSh: Using Scene Graphs to Guide LLMs-as-a-Judge for Detailed Image Descriptions

Amith Ananthram ⋅ Elias Stengel-Eskin ⋅ Lorena Bradford ⋅ Julia Demarest ⋅ Adam Purvis ⋅ Keith Krut ⋅ Robert Stein ⋅ Rina Pantalony ⋅ Mohit Bansal ⋅ Kathleen McKeown

While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman ρ) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

Poster

P4-#3509

MRAD: Zero-Shot Anomaly Detection with Memory-Driven Retrieval

Chaoran Xu ⋅ Chengkan Lv ⋅ Qiyu Chen ⋅ Feng Zhang ⋅ Zhengtao Zhang

Zero-shot anomaly detection (ZSAD) often leverages pretrained vision or vision-language models, but many existing methods use prompt learning or complex modeling to fit the data distribution, resulting in high training or inference cost and limited cross-domain stability. To address these limitations, we propose Memory-Retrieval Anomaly Detection method (MRAD), a unified framework that replaces parametric fitting with a direct memory retrieval. The train-free base model, MRAD-TF, freezes the CLIP image encoder and constructs a two-level memory bank (image-level and pixel-level) from auxiliary data, where feature-label pairs are explicitly stored as keys and values. During inference, anomaly scores are obtained directly by similarity retrieval over the memory bank. Based on the MRAD-TF, we further propose two lightweight variants as enhancements: (i) MRAD-FT fine-tunes the retrieval metric with two linear layers to enhance the discriminability between normal and anomaly; (ii) MRAD-CLIP injects the normal and anomalous region priors from the MRAD-FT as dynamic biases into CLIP's learnable text prompts, strengthening generalization to unseen categories. Across 16 industrial and medical datasets, the MRAD framework consistently demonstrates superior performance in anomaly classification and segmentation, under both train-free and training-based settings. Our work shows that fully leveraging the empirical distribution of raw data, rather than relying only on model fitting, can achieve stronger anomaly detection performance. The code has been publicly released at https://github.com/CROVO1026/MRAD.

Poster

P4-#3508

Faster Vision Transformers with Adaptive Patches

Rohan Choudhury ⋅ JungEun Kim ⋅ Jinhyung Park ⋅ Eunho Yang ⋅ Laszlo A. Jeni ⋅ Kris Kitani

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40\% on ViT-L and 50\% on ViT-H while maintaining downstream performance. It can be applied to a previously fine-tuned ViT and converges in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation. We will release all code and trained models.

Poster

P4-#3507

Constructive Distortion: Improving MLLMs with Attention-Guided Image Warping

Dwip Dalal ⋅ Gautam Vashishtha ⋅ Utkarsh Mishra ⋅ Jeonghwan Kim ⋅ Madhav Kanda ⋅ Hyeonjeong Ha ⋅ Svetlana Lazebnik ⋅ Heng Ji ⋅ Unnat Jain

Multimodal large language models (MLLMs) often miss small details and spatial relations in cluttered scenes, leading to errors in fine-grained perceptual grounding. We introduce AttWarp, a lightweight method that allocates more resolution to query-relevant content while compressing less informative areas, all while pre- serving global context. At test time, AttWarp closes a simple self-correction loop: the MLLM first produces cross-modal attention on the original image, which we use to rectilinearly warp the input and re-run the same frozen model, reallocating resolution toward regions it deems important without changing weights or architecture. This attention-guided warping preserves all original image information but redistributes it non-uniformly, so small objects and subtle relationships become easier for the same model to read while the global layout remains intact. Across nine benchmarks (TextVQA, GQA, DocVQA, POPE, MMMU, MIA- Bench, MMVP, RealWorldQA, BLINK) and four MLLMs (LLaVA, Qwen-VL, InternVL, and InstructBLIP), AttWarp consistently improves accuracy, strengthens compositional reasoning, and reduces hallucinations, outperforming four competitive baselines that manipulate raw images at test time. Together, these results show that attention-guided warping prioritizes information relevant to the query while preserving context, and that the same MLLMs perform better when given such warped inputs. The code and demos are available on the project page: https://dwipddalal.github.io/Attwarp/

Poster

P4-#3506

Flatness Guided Test-Time Adaptation for Vision-Language Models

aodi Li ⋅ Liansheng Zhuang ⋅ Xiao Long ⋅ Houqiang Li ⋅ Shafei Wang

Test-time adaptation (TTA) of Vision-Language Models (VLMs) has emerged as a technique for tackling distribution shifts during the test time. Recent research indicates that the test-time adaptation is intrinsically linked to the model's training history. However, existing TTA methods, such as Test-time Prompt Tuning, often design adaptation strategies in isolation from the models' training characteristics, which degrade their performance. This paper argues that the flatness acquired via sharpness-aware training is an efficient clue for the test-time adaptation of VLMs. Built on this insight, this paper proposes a novel Flatness-Guided Adaptation framework (FGA) for VLMs to cohesively unify training and test-time procedures. Its core idea is to leverage the alignment between the training minimum and test loss flat regions to guide the adaptation process. Specifically, our FGA consists of a prompt-tuning stage and a test-time adaptation stage. In the tuning stage, a Sharpness-Aware Prompt Tuning method is utilized to identify the training flat minimum, offering a geometric clue of flatness for subsequent adaptation. In the test stage, a Sharpness-based Test Sample Selection approach is proposed to ensure the alignment of flat minima between the training and each augmented test sample's loss landscape. In comparison to existing TTA methods, our FGA avoids the expensive prompt parameter updates during test time, and substantially reduces the computation overhead. Extensive experiments on both domain generalization and cross-dataset benchmarks demonstrate that our FGA achieves superior performance over prevalent TTA methods. Notably, when employing a ViT-B/16 image encoder, FGA even outperforms TPT+CoOp by an average of 4.88\% across all four ImageNet out-of-domain variants.

Poster

P4-#3505

MARS - A Foundational Map Auto-Regressor

Qi Zhang ⋅ Suvam Bag ⋅ Rupanjali Kukal ⋅ Mikael Figueroa ⋅ Rishi Madhok ⋅ Nikolaos Karianakis ⋅ Fuxun Yu

Map generation tasks feature extensive non-structural vectorized data (e.g., points, polylines, and polygons) and thus pose significant challenges to common pixel-wise generative models. Conventional approaches use multiple stages, first segmenting these features at the pixel level and then performing vectorized post-processing, with errors and complexity compounding at each stage. Motivated by the recent success of auto-regressive language modeling, we propose the first map foundation model, named Map Auto-Regressor (MARS), that is capable of generating both multi-polyline road networks and polygon buildings in a unified manner. For training MARS, we collected to our knowledge the largest multi-class map extraction dataset totaling 3.4M examples, which we call MAP-3M. Across four road and building datasets, MARS outperforms or matches the performance of multistage baselines. Additionally, we develop a ``Chat with MARS'' capability that enables interactive human-in-the-loop map generation and correction, supported by the auto-regressive nature of our end-to-end approach. We release our MAP-3M dataset and project demo page at (1) https://huggingface.co/datasets/bag-lab/MAP-3M and (2) https://huggingface.co/spaces/bag-lab/MARS, respectively.

Poster

P4-#3504

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Jiyoon Pyo ⋅ Yuankun Jiao ⋅ Dongwon Jung ⋅ Zekun Li ⋅ Leeje Jang ⋅ Sofia Kirsanova ⋅ Jina Kim ⋅ Yijun Lin ⋅ Qin Liu ⋅ Junyi Xie ⋅ Hadi Askari ⋅ Nan Xu ⋅ Muhao Chen ⋅ Yao-Yi Chiang

Cartographic reasoning is the skill of interpreting geographic relationships by aligning legends, map scales, compass directions, map texts, and geometries across one or more map images. Although essential as a concrete cognitive capability and for critical tasks such as disaster response and urban planning, it remains largely unevaluated. Building on progress in chart and infographic understanding, recent large vision language model (LVLM) works on map visual question-answering (VQA) often simplify maps as a special case of charts. In contrast, map VQA demands comprehension of layered symbology (e.g., symbols, geometries, and text labels) as well as spatial relations tied to orientation and distance that often span multiple maps and are not captured by chart-style evaluations. To address this gap, we introduce FRIEDA, a benchmark for testing complex open-ended cartographic reasoning in LVLMs. FRIEDA sources real map images from documents and reports in various domains (e.g., geology, urban planning, and environmental assessment) and geographical areas. Following classifications in Geographic Information System (GIS) literature, FRIEDA targets all three categories of spatial relations: topological (border, equal, intersect, within), metric (distance), and directional (orientation). All questions require multi-step inference, and many require cross-map grounding and reasoning. We evaluate eleven state-of-the-art LVLMs under two settings: (1) the direct setting, where we provide the maps relevant to the question, and (2) the contextual setting, where the model may have to identify the maps relevant to the question before reasoning. Even the strongest models, Gemini-2.5-Pro and GPT-5-Think, achieve only 38.20\% and 37.20\% accuracy, respectively, far below human performance of 84.87\%. These results reveal a persistent gap in multi-step cartographic reasoning, positioning FRIEDA as a rigorous benchmark to drive progress on spatial intelligence in LVLMs.

Poster

P4-#3503

ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang ⋅ Wenlong Huang ⋅ Yu Zhou ⋅ Hang Yin ⋅ Tianwei Bao ⋅ Jianwen Lyu ⋅ Weiyu Liu ⋅ Ruohan Zhang ⋅ Jiajun Wu ⋅ Li Fei-Fei ⋅ Manling Li

Embodied cognition argues that intelligence arises from continuous sensorimotor interaction with the world. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? To investigate this, we introduce ENACT, a benchmark that probes this question through world modeling from egocentric interaction. Grounded in a partially observable Markov decision process (POMDP) framework, ENACT comprises two complementary sequence reordering tasks: forward world modeling (predicting an ordered sequence of future states from actions) and inverse world modeling (inferring an ordered sequence of actions from state changes). Correctly solving these tasks indicates that the model has a solid understanding of how the environment will evolve given one's actions. Our scalable dataset contains 8,972 QA pairs derived from diverse, long-horizon household activities in the BEHAVIOR simulator. Experiments reveal a significant performance gap between state-of-the-art VLMs and humans, which widens dramatically as interaction horizons lengthen. We find that models consistently solve the inverse problem better than the forward one and exhibit strong embodied biases, showing a preference for right-handed actions and performance degradation with camera perspectives that deviate from those of human vision.

Poster

P4-#3502

InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models

Nianchen Deng ⋅ Lixin Gu ⋅ Shenglong Ye ⋅ Yinan He ⋅ Zhe Chen ⋅ Songze Li ⋅ Haomin Wang ⋅ Jinhui Yin ⋅ Qi Wei ⋅ Tianshuo Yang ⋅ Min Dou ⋅ Tong He ⋅ Wenqi Shao ⋅ Kaipeng Zhang ⋅ Yi Wang ⋅ Botian Shi ⋅ Yanting Zhang ⋅ Jifeng Dai ⋅ Yu Qiao ⋅ Wenhai Wang ⋅ Hongjie Zhang

Recent benchmarks and datasets have been proposed to improve spatial reasoning in vision-language models (VLMs), yet existing open resources remain constrained by limited scale, narrow visual diversity, and restricted instruction expressiveness. To address these gaps, we present InternSpatial---the largest open-source dataset for spatial reasoning in VLMs---alongside InternSpatial-Bench, a comprehensive evaluation benchmark designed to assess spatial understanding across diverse instruction formats. InternSpatial contains 12 million question-answer(QA) pairs covering both single-view and multi-view scenarios, sourced from varied visual environments and supporting 19 distinct instruction formats that mirror real-world query patterns. InternSpatial-Bench aims to single-view assessment and also extends multi-view reasoning through a novel rotation estimation task. Experimental validation demonstrates that models trained on \trainset achieve substantial performance improvement of 12.1% on InternSpatial-Bench and 10.7% on VSI-Bench, while preserving competitive performance on general-purpose benchmarks. We expect these resources can advance the development of spatially-capable VLMs for practical applications in robotics and embodied AI systems. Our codes and datasets are publicly available at https://github.com/dengnianchen/intern-spatial.

Poster

P4-#3501

RegionReasoner: Region-Grounded Multi-Round Visual Reasoning

Wenfang Sun ⋅ Hao Chen ⋅ Yingjun Du ⋅ Yefeng Zheng ⋅ Cees G Snoek

Large vision-language models have achieved remarkable progress in visual reasoning, yet most existing systems rely on single-step or text-only reasoning, limiting their ability to iteratively refine understanding across multiple visual contexts. To address this limitation, we introduce a new multi-round visual reasoning benchmark with training and test sets spanning both detection and segmentation tasks, enabling systematic evaluation under iterative reasoning scenarios. We further propose RegionReasoner, a reinforcement learning framework that enforces grounded reasoning by requiring each reasoning trace to explicitly cite the corresponding reference bounding boxes, while maintaining semantic coherence via a global–local consistency reward. This reward extracts key objects and nouns from both global scene captions and region-level captions, aligning them with the reasoning trace to ensure consistency across reasoning steps. RegionReasoner is optimized with structured rewards combining grounding fidelity and global–local semantic alignment. Experiments on detection and segmentation tasks show that RegionReasoner-7B, together with our newly introduced benchmark RegionDial-Bench, considerably improves multi-round reasoning accuracy, spatial grounding precision, and global–local consistency, establishing a strong baseline for this emerging research direction.

Poster

P4-#3601

Knowledge Exchange with Confidence: Cost-Effective LLM Integration for Reliable and Efficient Visual Question Answering

Mahsa Mozaffari ⋅ Hitesh Sapkota ⋅ Xumin Liu ⋅ Qi Yu

Recent advances in large language models (LLMs) have improved the accuracy of visual question answering (VQA) systems. However, directly applying LLMs to VQA still presents several challenges: (a) suboptimal performance when handling questions from specialized domains, (b) higher computational costs and slower inference speed due to large model sizes, and (c) the absence of a systematic approach to precisely quantify the uncertainty of LLM responses, raising concerns about their reliability in high-stakes tasks. To address these issues, we propose an UNcertainty-aware LLM-Integrated VQA model ($\texttt{Uni-VQA}$). This model facilitates knowledge exchange between the LLM and a calibrated task-specific model (\ie \texttt{TS-VQA}), guided by reliable confidence scores, resulting in improved VQA accuracy, reliability and inference speed. Our framework strategically leverages these confidence scores to manage the interaction between the LLM and $\texttt{TS-VQA}$: the specialized questions are answered by the $\texttt{TS-VQA}$ model, while general knowledge questions are handled by the LLM. For questions requiring both specialized and general knowledge, the $\texttt{TS-VQA}$ provides candidate answers, which the LLM then combines with its internal knowledge to generate a more accurate response. Extensive experiments on VQA datasets demonstrate the theoretically justified advantages of $\texttt{Uni-VQA}$ over using the LLM or $\texttt{TS-VQA}$ alone.

Poster

P4-#3602

VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks

Xinlong Chen ⋅ Yuanxing Zhang ⋅ Yushuo Guan ⋅ Weihong Lin ⋅ Zekun Wang ⋅ Bohan Zeng ⋅ Yang Shi ⋅ Sihan Yang ⋅ Qiang Liu ⋅ Pengfei Wan ⋅ Liang Wang

The "Reason-Then-Respond" paradigm, enhanced by Reinforcement Learning, has shown great promise in advancing Multimodal Large Language Models. However, its application to the video domain has led to specialized models that excel at either question answering (QA) or captioning tasks, but struggle to master both. Naively combining reward signals from these tasks results in mutual performance degradation, which we attribute to a conflict between their opposing task natures. To address this challenge, we propose a novel training framework built upon two intermediate proxy tasks: DarkEventInfer, which presents videos with masked event segments, requiring models to infer the obscured content based on contextual video cues; and MixVidQA, which presents interleaved video sequences composed of two distinct clips, challenging models to isolate and reason about one while disregarding the other. These proxy tasks compel the model to simultaneously develop both holistic, divergent understanding and precise, convergent reasoning capabilities. Embodying this framework, we present VidBridge-R1, the first versatile video reasoning model that effectively bridges the paradigm conflict. Extensive experiments show that VidBridge-R1 achieves significant performance gains on both QA and captioning within one model, demonstrating the efficacy of our approach in fostering more generalizable and powerful video understanding models. All code, models, and data will be made publicly available.

Poster

P4-#3603

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Reza Pourreza ⋅ Rishit Dagli ⋅ Apratim Bhattacharyya ⋅ Sunny Panchal ⋅ Guillaume Berger ⋅ Roland Memisevic

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

Poster

P4-#3604

Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Jing Zhang ⋅ Zhikai Li ⋅ Xuewen Liu ⋅ Qingyi Gu

Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68$\times$ speedup on SAM2.1-L model with only 1.0\% accuracy drop on SA-V test set, where SWR and SMR provide 1.83$\times$ and 1.78$\times$ speedups, respectively.

Poster

P4-#3605

Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking

Shengqiong Wu ⋅ Bobo Li ⋅ Xinkai Wang ⋅ Xiangtai Li ⋅ Lei Cui ⋅ Furu Wei ⋅ Shuicheng YAN ⋅ Hao (Scofield) Fei ⋅ Tat-Seng Chua

Unified Vision–Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing–Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation.

Poster

P4-#3607

MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Zhenyu Pan ⋅ Han Liu

We present MetaSpatial, the first reinforcement learning (RL) framework for enhancing 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene layout generation without post-processing. MetaSpatial addresses two key challenges: (i) the need for extensive post-processing, as existing VLMs lack inherent 3D spatial reasoning to generate realistic layouts; and (ii) the inefficiency of supervised fine-tuning (SFT) for layout generation due to scarcity of perfect annotations. Our core contribution is the 3D Spatial Policy Optimization (3D-SPO) algorithm, which incorporates physics-aware modulation into advantage estimates at the object level and trajectory-level reward from a training-only multi-turn refinement pipeline. This design enhances temporal credit assignment and encourages spatially consistent policy learning. Empirical evaluations across models of varying scales demonstrate that MetaSpatial improves spatial coherence, physical plausibility, and formatting stability, leading to more realistic and functionally coherent object placements applicable to metaverse environments.

Poster

P4-#3608

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Rinyoichi Takezoe ⋅ Yaqian Li ⋅ Zi-Hao Bo ⋅ Anzhou Hou ⋅ Mo Guang ⋅ Kaiwen Long

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this issue by pruning unimportant visual tokens, achieving substantial computational reduction while maintaining model performance. The core of token pruning lies in determining token importance, with current approaches primarily relying on attention scores from vision encoders or Large Language Models (LLMs). In this paper, we analyze the effectiveness of attention mechanisms in both vision encoders and LLMs. We find that vision encoders suffer from attention sink, leading to poor focus on informative foreground regions, while in LLMs, although prior studies have identified attention bias toward token positions, text-to-vision attention demonstrates resistance to this bias and enables effective pruning guidance in middle layers. Based on these observations, we propose $\textbf{LearnPruner}$, a two-stage token pruning framework that first removes redundant vision tokens via a learnable pruning module after the vision encoder, then retains only task-relevant tokens in the LLM's middle layer. Experimental results show that our LearnPruner can preserve approximately 95\% of the original performance while using only 5.5\% of vision tokens, and achieve 3.2$\times$ inference acceleration, demonstrating a superior accuracy-efficiency trade-off.

Poster

P4-#3609

LINK: Learning Instance-level Knowledge from Vision-Language Models for Human-Object Interaction Detection

Eastman Z Y Wu ⋅ Ya-Li Li ⋅ Yuan Wang ⋅ Shengjin Wang

Human-Object Interaction (HOI) detection with vision-language models (VLMs) has progressed rapidly, yet a trade-off persists between specialization and generalization. Two major challenges remain: (1) the sparsity of supervision, which hampers effective transfer of foundation models to HOI tasks, and (2) the absence of a generalizable architecture that can excel in both fully supervised and zero-shot scenarios. To address these issues, we propose \textbf{LINK}, \textbf{L}earning \textbf{IN}stance-level \textbf{K}nowledge from VLMs. First, we introduce a HOI detection framework equipped with a Human-Object Geometrical Encoder and a VLM Linking Decoder. By decoupling from detector-specific features, our design ensures plug-and-play compatibility with arbitrary object detectors and consistent adaptability across diverse settings. Building on this foundation, we develop a Progressive Learning Strategy under a teacher-student paradigm, which delivers dense supervision over all potential human-object pairs. By contrasting subtle spatial and semantic differences between positive and negative instances, the model learns robust and transferable HOI representations. LINK sets new state-of-the-art on SWiG-HOI, HICO-DET, and V-COCO across zero-shot, fully supervised, and open-vocabulary settings, with strong generalization to unseen interaction categories.

Poster

P4-#3610

Point2RBox-v3: Self-Bootstrapping from Point Annotations via Integrated Pseudo-Label Refinement and Utilization

Teng Zhang ⋅ Ziqian Fan ⋅ Mingxin Liu ⋅ Xin Zhang ⋅ Xudong Lu ⋅ Wentong Li ⋅ Yue Zhou ⋅ Yi Yu ⋅ Xiang Li ⋅ Junchi Yan ⋅ Xue Yang

Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: $\textbf{1) Progressive Label Assignment (PLA)}$. It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. $\textbf{2) Prior-Guided Dynamic Mask Loss (PGDM-Loss)}$. It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM's poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09\%/56.86\%/41.28\%/46.40\%/19.60\%/45.96\% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.

Poster

P4-#3611

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma ⋅ Mu Zhang ⋅ Yibing Wang ⋅ Qixiang Ye

Optimizing discrete diffusion model (DDM) with rewards remains a challenge—the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Across math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, doubling reinforcement learning gains while speeding up training by up to 30%. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion. The code is available at https://github.com/martian422/MaskGRPO.

Poster

P4-#3612

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

David Ma ⋅ Huaqing Yuan ⋅ Xingjian Wang ⋅ Qianbo ZANG ⋅ Tianci Liu ⋅ Xinyang He ⋅ Yanbin Wei ⋅ Jiawei Guo ⋅ nijiahui ⋅ Zhenzhu Yang ⋅ Meng Cao ⋅ Shanghaoran Quan ⋅ Yizhi Li ⋅ Wangchunshu Zhou ⋅ JIAHENG LIU ⋅ Wenhao Huang ⋅ Ge Zhang ⋅ Shiwen Ni ⋅ Xiaojie Jin

Although long-video understanding demands that models capture hierarchical temporal information—from clip and shot to event and story—existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales\textemdash clip, shot, event, and story\textemdash all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg. 86 min) from 5 main categories and 36 sub-categories, with 4–8 carefully designed questions, with at least one question targeting each timescale. Evaluating 23 MLLMs reveals a distinct U-shaped performance trend: higher accuracy at the shortest (clip) and longest (story) timescales, with a dip at intermediate levels. Furthermore, ablation studies demonstrate that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a crucial fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available at \url{https://github.com/multimodal-art-projection/ScaleLong}

Poster

P4-#3613

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li ⋅ Yingxiu Zhao ⋅ Jiwen Zhang ⋅ Siyuan Wang ⋅ Yang Yao ⋅ Runzhou Zhao ⋅ Jun Song ⋅ Bo Zheng ⋅ zhongyu wei

Current visual reasoning methods mainly focus on exploring specific reasoning modes. Although improvements can be achieved in particular domains, they struggle to develop general reasoning capabilities. Inspired by this, we propose a novel adaptive reasoning paradigm, $\underline{\text{M}}$ixture-$\underline{\text{o}}$f-$\underline{\text{V}}$isual-$\underline{\text{T}}$houghts (**MoVT**), which unifies different reasoning modes within a single model and guides it to select the appropriate mode based on context. To achieve this, we introduce **AdaVaR**, a two-stage $\underline{\text{Ada}}$ptive $\underline{\text{V}}$isu$\underline{\text{a}}$l $\underline{\text{R}}$easoning learning framework: different modes are unified and learned during the supervised cold-start stage, and the mode selection capability is induced via an RL process with a carefully designed AdaGRPO algorithm. Extensive experiments show that AdaVaR effectively guides the model to learn and differentiate multiple modes and perform context-adaptive mode selection, achieving consistent improvement across various scenarios, highlighting MoVT as an effective solution for building general visual reasoning models.

Poster

P4-#3614

Fostering Video Reasoning via Next-Event Prediction

Haonan Wang ⋅ Hongfu Liu ⋅ Xiangyan Liu ⋅ Chao Du ⋅ Kenji Kawaguchi ⋅ Ye Wang ⋅ Tianyu Pang

Next-token prediction serves as the foundational learning task that enables reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video captioning primarily promote modality alignment, while video question answering typically relies on annotations from humans or much stronger MLLMs. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts events in the future, thereby encouraging the model to reason temporally in order to complete the task. To study this learning task, we curate V1-33K, a dataset comprising 33,000 automatically extracted videos spanning diverse real-world scenarios. Using the same videos, we further explore a range of video instruction-tuning tasks data to provide controlled comparisons and isolate the effect of NEP. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training task for fostering temporal reasoning in MLLMs.

Poster

P4-#3615

UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang ⋅ Chen-Wei Xie ⋅ Xiaoyi Bao ⋅ Tingyu Weng ⋅ Pandeng Li ⋅ Yun Zheng ⋅ Liwei Wang

In this paper, we propose UniLIP, a unified framework that adapts CLIP for multimodal understanding, generation and editing. Although CLIP excels at understanding, it lacks reconstruction abilities required to be a unified visual encoder. However, previous CLIP-based unified methods fail to balance understanding and reconstruction, leading to semantic degradation or inconsistent reconstructions. In contrast, we introduce a novel two-stage training scheme with a self-distillation strategy that progressively endows CLIP with high-fidelity reconstruction abilities while preserving its original comprehension performance. For enhanced reasoning and consistency in generation and editing, we further develop a dual-condition architecture built upon the MetaQuery framework. Our architecture jointly utilizes multimodal hidden states for rich contextual details and learnable query embeddings to harness the powerful reasoning abilities of Multimodal Large Language Models (MLLMs). Leveraging advanced image representation and architectural design, UniLIP demonstrates superior instruction following and edit fidelity. With only 1B and 3B parameters, UniLIP can outperform larger unified models such as BAGEL (7B) and Uniworld-V1 (12B), achieving state-of-the-art performance of 0.90 on GenEval, 0.63 on WISE, and 3.94 on ImgEdit. These results demonstrate that UniLIP successfully expands the application of CLIP, establishing its continuous features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks. Code and models are available at https://github.com/nnnth/UniLIP.

Poster

P4-#3616

TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Guangyi Han ⋅ Wei Zhai ⋅ Yuhang Yang ⋅ Yang Cao ⋅ Zheng-Jun Zha

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce $\textbf{Free-Form HOI Generation}$, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct $\textbf{WildO2}$, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 403 object categories, each with detailed semantic annotations. Building on this dataset, we propose $\textbf{TOUCH}$, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities.

Poster

P4-#3617

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

Yuanxin Liu ⋅ Kun Ouyang ⋅ Haoning Wu ⋅ Yi Liu ⋅ Lin Sui ⋅ Xinhao Li ⋅ Yan Zhong ⋅ Y.Charles ⋅ Xinyu Zhou ⋅ Xu Sun

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning—e.g., GPT-4o achieves only 6.9\% accuracy—while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

Poster

P4-#3216

GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Weijia Dou ⋅ Xu Zhang ⋅ Yi Bin ⋅ Jian Liu ⋅ Bo Peng ⋅ Guoqing Wang ⋅ Yang Yang ⋅ Heng Tao Shen

Recent attempts to transfer features from 2D Vision–Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale, annotated 3D data. We argue that this limitation stems from the dominant \textit{segmentation-and-matching} paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose \textbf{GeoPurify} that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only \textbf{$\sim$1.5\%} of the training data.

Poster

P4-#3618

Sequential Information Bottleneck Fusion: Towards Robust and Generalizable Multi-Modal Brain Tumor Segmentation

TIANYI LIU ⋅ Xi Yang ⋅ Wei Wang ⋅ Anh Nguyen ⋅ Haochuan Jiang ⋅ Kaizhu Huang

Brain tumor segmentation in multi-modal MRIs poses significant challenges when one or more modalities are missing. Recent approaches commonly employ parallel fusion strategies; however, these methods often risk losing crucial shared information across modalities, which can degrade segmentation performance. In this paper, we advocate leveraging sequential information bottleneck fusion to effectively preserve shared information across modalities. From an information-theoretic perspective, sequential fusion not only produces more robust fused representations in missing-data scenarios but also achieves a tighter generalization upper bound compared to parallel fusion approaches. Building on this principle, we propose the Sequential Multi-modal Segmentation Network (SMSN), which integrates an Information-Bottleneck Fusion Module (IBFM). The IBFM sequentially extracts modality-common features while reconstructing modality-specific features through a dedicated feature extraction module. Extensive experiments on the BRATS18 and BRATS20 glioma datasets demonstrate that SMSN consistently outperforms traditional parallel fusion-based baselines, achieving exceptional robustness in diverse missing-modality settings. Furthermore, SMSN exhibits superior cross-domain generalization, as evidenced by its ability to transfer a trained model from BRATS20 to a brain metastasis dataset without fine-tuning. To ensure reproducibility, the code of the SMSN is provided in the supplementary file.

Poster

P4-#3718

Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation

Jinpeng Lu ⋅ Linghan Cai ⋅ Yinda Chen ⋅ Guo Tang ⋅ Songhan Jiang ⋅ Haoyuan Shi ⋅ Zhiwei Xiong

Lightweight 3D medical image segmentation remains constrained by a fundamental "efficiency / robustness conflict", particularly when processing complex anatomical structures and heterogeneous modalities. In this paper, we study how to redesign the framework based on the characteristics of high-dimensional 3D images, and explore data synergy to overcome the fragile representation of lightweight methods. Our approach, VeloxSeg, begins with a deployable and extensible dual-stream CNN-Transformer architecture composed of Paired Window Attention (PWA) and Johnson-Lindenstrauss lemma-guided convolution (JLC). For each 3D image, we invoke a "glance-and-focus" principle, where PWA rapidly retrieves multi-scale information, and JLC ensures robust local feature extraction with minimal parameters, significantly enhancing the model's ability to operate with low computational budget. Followed by an extension of the dual-stream architecture that incorporates modal interaction into the multi-scale image-retrieval process, VeloxSeg efficiently models heterogeneous modalities. Finally, Spatially Decoupled Knowledge Transfer (SDKT) via Gram matrices injects the texture prior extracted by a self-supervised network into the segmentation network, yielding stronger representations than baselines at no extra inference cost. Experimental results on multimodal benchmarks show that VeloxSeg achieves a 26\% Dice improvement, alongside increasing GPU throughput by 11x, CPU by 48x, and reducing training peak GPU memory usage by 1/20, inference by 1/24. Code is available at https://github.com/JinPLu/VeloxSeg.

Poster

P4-#3717

Interaction-aware Representation Modeling With Co-Occurrence Consistency for Egocentric Hand-Object Parsing

Yuejiao Su ⋅ Yi Wang ⋅ Lei Yao ⋅ Yawen Cui ⋅ Lap-Pui Chau

A fine-grained understanding of egocentric human-environment interactions is crucial for developing next-generation embodied agents. One fundamental challenge in this area involves accurately parsing hands and active objects. While transformer-based architectures have demonstrated considerable potential for such tasks, several key limitations remain unaddressed: 1) existing query initialization mechanisms rely primarily on semantic cues or learnable parameters, demonstrating limited adaptability to changing active objects across varying input scenes; 2) previous transformer-based methods utilize pixel-level semantic features to iteratively refine queries during mask generation, which may introduce interaction-irrelevant content into the final embeddings; and 3) prevailing models are susceptible to ``interaction illusion'', producing physically inconsistent predictions. To address these issues, we propose an end-to-end Interaction-aware Transformer (InterFormer), which integrates three key components, i.e., a Dynamic Query Generator (DQG), a Dual-context Feature Selector (DFS), and the Conditional Co-occurrence (CoCo) loss. The DQG explicitly grounds query initialization in the spatial dynamics of hand-object contact, enabling targeted generation of interaction-aware queries for hands and various active objects. The DFS fuses coarse interactive cues with semantic features, thereby suppressing interaction-irrelevant noise and emphasizing the learning of interactive relationships. The CoCo loss incorporates hand-object relationship constraints to enhance physical consistency in prediction. Our model achieves state-of-the-art performance on both the EgoHOS and the challenging out-of-distribution mini-HOI4D datasets, demonstrating its effectiveness and strong generalization ability. Code and models are publicly available at https://github.com/yuggiehk/InterFormer.

Poster

P4-#3716

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Sanghyun Jo ⋅ Ziseok Lee ⋅ Wooyeol Lee ⋅ Jonghyun Choi ⋅ Jaesik Park ⋅ Kyungsu Kim

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81× faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Project Page: https://shjo-april.github.io/TRACE.

Poster

P4-#3606

Sequential Parallel Duality in Prefix Scannable Models

Morris Yau ⋅ Sharut Gupta ⋅ Valerie Engelmayer ⋅ Kazuki Irie ⋅ Stefanie Jegelka ⋅ Jacob Andreas

Modern neural sequence models are designed to meet the dual mandate of parallelizable training and fast sequential inference. Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba, that achieve such ``sequential-parallel duality.'' This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference? We begin by describing a broad class of such models -- state space models -- as those whose state updates can be computed using the classic parallel prefix scan algorithm with a custom associative aggregation operator. We then define a more general class, Prefix-Scannable Models (PSMs), by relaxing the state aggregation operator to allow arbitrary (potentially non-associative) functions such as softmax attention. This generalization unifies many existing architectures, including element-wise RNNs (e.g., Mamba) and linear transformers (e.g., GLA, Mamba2, mLSTM), while also introducing new models with softmax-like operators that achieve O(1) amortized compute per token and log(N) memory for sequence length N. We empirically evaluate such models on illustrative small-scale language modeling and canonical synthetic tasks, including state tracking and associative recall. Empirically, we find that PSMs retain the expressivity of transformer-based architectures while matching the inference efficiency of state space models -- in some cases exhibiting better length generalization than either.

Poster

P4-#3715

All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning

Zheng Yang ⋅ Ruoxin Chen ⋅ Zhiyuan Yan ⋅ Ke-Yue Zhang ⋅ Xinghe Fu ⋅ Shuang Wu ⋅ Xiujun Shu ⋅ Taiping Yao ⋅ Shouhong Ding ⋅ Zequn Qin ⋅ Xi Li

The rapid proliferation of AI-generated images (AIGIs) highlights the pressing demand for generalizable detection methods. In this paper, we establish two key principles for AIGI detection task through systematic analysis: (1) All Patches Matter, since the uniform generation process ensures that each patch inherently contains synthetic artifacts, making every patch a valuable detection source; and (2) More Patches Better, as leveraging distributed artifacts across more patches improves robustness by reducing over-reliance on specific regions. However, counterfactual analysis uncovers a critical weakness: naively trained detectors display Few-Patch Bias, relying disproportionately on minority patches. We identify this bias to Lazy Learner effect, where detectors to limited patch artifacts while neglecting distributed cues. To address this, we propose Panoptic Patch Learning framework, which integrates: (1) Randomized Patch Reconstruction, injecting synthetic cues into randomly selected patches to diversify artifact recognition; (2) Patch-wise Contrastive Learning, enforcing consistent discriminative capability across patches to ensure their uniform utilization. Extensive experiments demonstrate that PPL enhances generalization and robustness across datasets.

Poster

P4-#3714

HSIC Bottleneck for Cross-Generator and Domain-Incremental Synthetic Image Detection

CHIN-CHIA YANG ⋅ Yung-Yu Chuang ⋅ Hwann-Tzong Chen ⋅ Tyng-Luh Liu

Synthetic image generators evolve rapidly, challenging detectors to generalize across current methods and adapt to new ones. We study domain-incremental synthetic image detection with a two-phase evaluation. Phase I trains on either diffusion- or GAN-based data and tests on the combined group to quantify bidirectional cross-generator transfer. Phase II sequentially introduces renders from 3D Gaussian Splatting (3DGS) head avatar pipelines, requiring adaptation while preserving earlier performance. We observe that CLIP-based detectors inherit text-image alignment semantics that are irrelevant to authenticity and hinder generalization. We introduce a Hilbert-Schmidt Independence Criterion (HSIC) bottleneck loss on intermediate CLIP ViT features, encouraging representations predictive of real versus synthetic while independent of generator identity and caption alignment. For domain-incremental learning, we propose HSIC-Guided Replay (HGR), which selects per-class exemplars via a hybrid score combining HSIC relevance with k-center coverage, yielding compact memories that mitigate forgetting. Empirically, the HSIC bottleneck improves transfer between diffusion and GAN families, and HGR sustains prior accuracy while adapting to 3DGS renders. These results underscore the value of information-theoretic feature shaping and principled replay for resilient detection under shifting generative regimes.

Poster

P4-#3713

Part-level Semantic-guided Contrastive Learning for Fine-grained Visual Classification

Zhijian Lin ⋅ Hong Han

Fine-Grained Visual Classification (FGVC) aims to distinguish visually similar subcategories within a broad category, and poses significant challenges due to subtle inter-class differences, large intra-class variations, and data scarcity. Existing methods often struggle to effectively capture both part-level detail and spatial relational features, particularly across rigid and non-rigid object categories. To address these issues, we propose Part-level Semantic-guided Contrastive Learning (PSCL), a novel framework that integrates three key components. (1) The Part Localization Module (PLM) leverages clearCLIP to enable text-controllable region selection, achieving decoupled and semantically guided spatial feature extraction. (2) The Multi-scale Multi-part Branch Progressive Reasoning (MMBPR) module captures discriminative features across multiple parts and scales, while reducing inter-branch redundancy. (3) The Visual-Language Contrastive Learning based on Multi-grained Text Features (VLCL-MG) module introduces intermediate-granularity category concepts to improve feature alignment and inter-class separability. Extensive experiments on five publicly available FGVC datasets demonstrate the superior performance and generalization ability of PSCL, validating the effectiveness of its modular design and the synergy between vision and language. Code is available at: https://github.com/joker-lin9/PSCL

Poster

P4-#3712

No Pixel Left Behind: A Detail-Preserving Architecture for Robust High-Resolution AI-Generated Image Detection

Lianrui Mu ⋅ Haoji Hu ⋅ Xingze Zou ⋅ Jianhong Bai ⋅ Jiaqi Hu

The rapid growth of high-resolution, meticulously crafted AI-generated images poses a significant challenge to existing detection methods, which are often trained and evaluated on low-resolution, automatically generated datasets that do not align with the complexities of high-resolution scenarios. A common practice is to resize or center-crop high-resolution images to fit standard network inputs. However, without full coverage of all pixels, such strategies risk either obscuring subtle, high-frequency artifacts or discarding information from uncovered regions, leading to input information loss. In this paper, we introduce the High-Resolution Detail-Aggregation Network (HiDA-Net), a novel framework that ensures no pixel is left behind. We use the Feature Aggregation Module (FAM), which fuses features from multiple full-resolution local tiles with a down-sampled global view of the image. These local features are aggregated and fused with global representations for final prediction, ensuring that native-resolution details are preserved and utilized for detection. To enhance robustness against challenges such as localized AI manipulations and compression, we introduce Token-wise Forgery Localization (TFL) module for fine-grained spatial sensitivity and JPEG Quality Factor Estimation (QFE) module to disentangle generative artifacts from compression noise explicitly. Furthermore, to facilitate future research, we introduce HiRes-50K, a new challenging benchmark consisting of 50,568 images with up to 64 megapixels. Extensive experiments show that HiDA-Net achieves state-of-the-art, increasing accuracy by over 13% on the challenging Chameleon dataset and 8% on our HiRes-50K.

Poster

P4-#3711

RestoreVAR: Visual Autoregressive Generation for All-in-One Image Restoration

Sudarshan Rajagopalan ⋅ Kartik Narayan ⋅ Vishal Patel

The use of latent diffusion models (LDMs) such as Stable Diffusion has significantly improved the perceptual quality of All-in-One image Restoration (AiOR) methods, while also enhancing their generalization capabilities. However, these LDM-based frameworks suffer from slow inference due to their iterative denoising process, rendering them impractical for time-sensitive applications. Visual autoregressive modeling (VAR), a recently introduced approach for image generation, performs scale-space autoregression and achieves comparable performance to that of state-of-the-art diffusion transformers with drastically reduced computational costs. Moreover, our analysis reveals that coarse scales in VAR primarily capture degradations while finer scales encode scene detail, simplifying the restoration process. Motivated by this, we propose RestoreVAR, a novel VAR-based generative approach for AiOR that significantly outperforms LDM-based models in restoration performance while achieving over $\mathbf{10\times}$ faster inference. To optimally exploit the advantages of VAR for AiOR, we propose architectural modifications and improvements, including intricately designed cross-attention mechanisms and a latent-space refinement module, tailored for the AiOR task. Extensive experiments show that RestoreVAR achieves state-of-the-art performance among generative AiOR methods, while also exhibiting strong generalization capabilities.

Poster

P4-#3710

KinemaDiff: Towards Diffusion for Coherent and Physically Plausible Human Motion Prediction

Ye Lu ⋅ Jie Wang ⋅ Tianyi Liu ⋅ Jianjun Gao ⋅ Kim-Hui Yap

Stochastic Human Motion Prediction (HMP) has become an essential task for the realm of computer vision, for its capacity to anticipate accurate and diverse future human trajectories. Current diffusion-based techniques typically enforce skeletal consistency by encoding structural priors into network architectures. Although effective in promoting plausible kinematics, this approach provides only indirect control over the generative process and often fails to guarantee strict physical constraint satisfaction. In this work, we propose a structure-aligned and joint-aware diffusion framework that enforces physical constraints by embedding skeletal topology and joint-specific dynamics directly into the diffusion process. Specifically, our framework consists of two key modules, the Joint-Adaptive Noise Generator and the Structure-Aligned Constraint Enforcer. The former component, Joint-Adaptive Noise Generator, infers joint-specific dynamics and injects heterogeneous, instance-aware noise per joint and sample to capture spatial variability and enhance motion diversity. The latter component, Structure-Aligned Constraint Enforcer, encodes skeletal topology by modeling joint connectivity and bone lengths from historical motions, and it constrains each denoising step to preserve anatomical consistency. Through their synergistic operation, these modules grant KinemaDiff direct control over physical realism and motion diversity, addressing the common limitations of indirect structural priors and uniform noise application. Extensive experiments on multiple benchmarks demonstrate the effectiveness of our method, attributable to tailoring the diffusion process through structural alignment and joint-adaptive noise modeling.

Poster

P4-#3709

LucidFlux: Caption-Free Universal Image Restoration via a Large-Scale Diffusion Transformer

Song Fei ⋅ Tian Ye ⋅ Lujia Wang ⋅ Lei Zhu

Universal image restoration (UIR) aims to recover images degraded by unknown mixtures while preserving semantics—conditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free UIR framework that adapts a large diffusion transformer (Flux.1) to restoration with minimal parameter overhead. LucidFlux introduces a lightweight \emph{dual-branch conditioner} that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. A timestep- and layer-adaptive modulation schedule routes these cues across the backbone’s hierarchy, yielding coarse-to-fine, context-aware updates that protect global structure while recovering texture. To avoid the latency and instability of text prompts or VLM captions, we enforce \emph{caption-free semantic alignment} via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, LucidFlux consistently surpasses strong open-source and commercial baselines across seven metrics, with clear visual gains in realism, detail, and artifact suppression. Ablations confirm that, for large DiTs, when, where, and what to condition—rather than scaling parameters or relying on text prompts—is the key lever for robust, prompt-free restoration.

Poster

P4-#3708

FARTrack: Fast Autoregressive Visual Tracking with High Performance

Guijie Wang ⋅ Tong Lin ⋅ Yifan Bai ⋅ Anjia Cao ⋅ Shiyi Liang ⋅ Wangbo Zhao ⋅ Xing Wei

Inference speed and tracking performance are two critical evaluation metrics in the field of visual tracking. However, high-performance trackers often suffer from slow processing speeds, making them impractical for deployment on resource-constrained devices. To alleviate this issue, we propose $\textbf{FARTrack}$, a $\textbf{F}$ast $\textbf{A}$uto-$\textbf{R}$egressive $\textbf{T}$racking framework. Since autoregression emphasizes the temporal nature of the trajectory sequence, it can maintain high performance while achieving efficient execution across various devices. FARTrack introduces $\textbf{Task-Specific Self-Distillation}$ and $\textbf{Inter-frame Autoregressive Sparsification}$, designed from the perspectives of $\textbf{shallow-yet-accurate distillation}$ and $\textbf{redundant-to-essential token optimization}$, respectively. Task-Specific Self-Distillation achieves model compression by distilling task-specific tokens layer by layer, enhancing the model's inference speed while avoiding suboptimal manual teacher-student layer pairs assignments. Meanwhile, Inter-frame Autoregressive Sparsification sequentially condenses multiple templates, avoiding additional runtime overhead while learning a temporally-global optimal sparsification strategy. FARTrack demonstrates outstanding speed and competitive performance. It delivers an AO of 70.6\% on GOT-10k in real-time. Beyond, our fastest model achieves a speed of 343 FPS on the GPU and 121 FPS on the CPU. Source code is available at: https://github.com/MIV-XJTU/FARTrack.git

Poster

P4-#3707

QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture

Cuong Le ⋅ Pavlo Melnyk ⋅ Urs Waldmann ⋅ Mårten Wadenbäck ⋅ Bastian Wandt

Vision-based 3D human motion capture from videos remains a challenge in computer vision. Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion. The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead. A major drawback in current kinematics approaches is their reliance on Euler angles. Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable. Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses. In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture. We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity. The corresponding angular acceleration are computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly change to new pose. Unlike previous work, our QDE is solved under the quaternion geometric constraints that results in more accurate estimations. Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausible artifact. QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and a subset of AIST. The code is available at https://github.com/cuongle1206/QuaMo

Poster

P4-#3706

Fore-Mamba3D: Mamba-based Foreground-Enhanced Encoding for 3D Object Detection

Zhiwei Ning ⋅ Xuanang Gao ⋅ Jiaxi Cao ⋅ Runze Yang ⋅ Huiying Xu ⋅ Xinzhong Zhu ⋅ Jie Yang ⋅ Wei Liu

Linear modeling methods like Mamba have been merged as the effective backbone for the 3D object detection task. However, previous Mamba-based methods utilize the bidirectional encoding for the whole non-empty voxel sequence, which contains abundant useless background information in the scenes. Though directly encoding foreground voxels appears to be a plausible solution, it tends to degrade detection performance. We attribute this to the response attenuation and restricted context representation in the linear modeling for fore-only sequences. To address this problem, we propose a novel backbone, termed Fore-Mamba3D, to focus on the foreground enhancement by modifying Mamba-based encoder. The foreground voxels are first sampled according to the predicted scores. Considering the response attenuation existing in the interaction of foreground voxels across different instances, we design a regional-to-global slide window (RGSW) to propagate the information from regional split to the entire sequence. Furthermore, a semantic-assisted and state spatial fusion module (SASFMamba) is proposed to enrich contextual representation by enhancing semantic and geometric awareness within the Mamba model. Our method emphasizes foreground-only encoding and alleviates the distance-based and causal dependencies in the linear autoregression model. The superior performance across various benchmarks demonstrates the effectiveness of Fore-Mamba3D in the 3D object detection task.

Poster

P4-#3705

Video Scene Segmentation with Genre and Duration Signals

Jungu Cho ⋅ Seong Jong Ha ⋅ Hae-Gon Jeon

Video scene segmentation aims to detect semantically coherent boundaries in long-form videos, bridging the gap between low-level visual signals and high-level narrative understanding. However, existing methods primarily rely on visual similarity between adjacent shots, which makes it difficult to accurately identify scene boundaries, especially when semantic transitions do not align with visual changes. In this paper, we propose a novel approach that incorporates production-level metadata, specifically genre conventions and shot duration patterns, into video scene segmentation. Our main contributions are three-fold: (1) we leverage textual genre definitions as semantic priors to guide shot-level representation learning during self-supervised pretraining, enabling better capture of narrative coherence; (2) we introduce a duration-aware anchor selection strategy that prioritizes shorter shots based on empirical duration statistics, improving pseudo-boundary generation quality; (3) we propose a test-time shot splitting strategy that subdivides long shots into segments for improved temporal modeling. Experimental results demonstrate state-of-the-art performance on MovieNet-SSeg and BBC datasets. We introduce MovieChat-SSeg, extending MovieChat-1K with manually annotated scene boundaries across 1,000 videos spanning movies, TV series, and documentaries.

Poster

P4-#3704

CortiLife: A Unified Framework for Cortical Representation Learning across the Lifespan

Pengcheng Xue ⋅ Dong Nie ⋅ Jie Luo ⋅ Daoqiang Zhang ⋅ Xuyun Wen

The human cerebral cortex encodes rich neurobiological information that is essential for understanding brain development, aging, and disease. Although various cortical representation learning methods have been proposed, existing models are typically restricted to stage-specific cohorts and lack generalization across the lifespan. While recent vision-language models offer a promising direction, building a unified framework for cortical representation faces three key challenges: (1) the non-Euclidean manifold structure of cortical surfaces, (2) homogenization of individual folding patterns induced by registration, and (3) distribution shifts of cortical features across the lifespan. To address these issues, we present CortiLife, the first unified vision-language framework for lifespan-aware cortical representation learning. Specifically, CortiLife introduces a surface tokenizer that integrates icosahedron-based surface patchification with multi-level patch encoding to transform complex cortical manifolds into compact token representations. The multi-level encoding incorporates three complementary streams that capture local topology, global interactions, and patch-wise distributional patterns, effectively mitigating the challenges of homogenization and distribution shifts. Furthermore, CortiLife integrates masked self-distillation with metadata language prompting, embedding information such as age, sex, health status, and attribution type into the text encoder to better capture individual-specific cortical representations while enabling both age-aware and modality-aware modeling. Extensive experiments on downstream tasks, including two encoder-frozen tasks (age prediction and cortical parcellation) and four encoder fine-tuning tasks (brain disorder diagnosis), demonstrate that CortiLife consistently outperforms state-of-the-art baselines across different age stages and modality types, underscoring its effectiveness and generalization ability.

Poster

P4-#3703

RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Victor Oei ⋅ Jenny Schmalfuss ⋅ Lukas Mehl ⋅ Madlen Bartsch ⋅ Shashank Agnihotri ⋅ Margret Keuper ⋅ Andreas Bulling ⋅ Andres Bruhn

Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables public two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark that treats robustness as a first-class citizen to foster models that combine accuracy with resilience.

Poster

P4-#3702

Self-Guided Low Light Object Detection Framework

Gwangik Shin ⋅ Jaeha Song ⋅ Soonmin Hwang

Object detection in low-light environments is inherently challenging due to limited contrast and heavy noise, both of which significantly degrade feature representations. In this paper, we propose a novel self-guided low-light object detection framework that effectively addresses these issues without introducing additional parameters or increasing inference time. Our method incorporates a detachable auxiliary pipeline during training, consisting of an image enhancement module and a denoising module, followed by a Fourier-domain fusion block. This pipeline improves the feature representation of the detector's backbone, enhancing its robustness under low-light conditions. Importantly, at inference time, our method incurs no additional computational cost compared to the baseline detector while achieving substantial performance improvements. Extensive experiments on widely used low-light object detection benchmarks, such as DARK FACE and ExDark, demonstrate that our method achieves state-of-the-art performance. Notably, experiments on the nuImages dataset show that our approach can outperform domain adaptation methods—especially when a large domain gap between source and target domains is inevitable in the real-world applications—highlighting its practical effectiveness. Code is available at https://github.com/gw-shin/SGLDet.

Poster

P4-#3701

WAFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang ⋅ Jia Deng

We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being 1.3-4.1x faster than existing methods that have competitive accuracy (e.g., 1.3x than Flowformer++, 4.1x than CCMR+). Code and model weights are available at https://github.com/princeton-vl/WAFT.

Poster

P4-#3801

Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

Marc Benedi San Millan ⋅ Angela Dai ⋅ Matthias Niessner

Animation of humanoid characters is essential in various graphics applications, but require significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations

Poster

P4-#3802

Seeing Through the PRISM: Compound & Controllable Restoration of Scientific Images

Rupa Kurinchi-Vendhan ⋅ Pratyusha Sharma ⋅ Antonio Torralba ⋅ Sara Beery

Scientific and environmental imagery often suffer from complex mixtures of noise related to the sensor and the environment. Existing restoration methods typically remove one degradation at a time, leading to cascading artifacts, overcorrection, or loss of meaningful signal. In scientific applications, restoration must be able to simultaneously handle compound degradations while allowing experts to selectively remove subsets of distortions without erasing important features. To address these challenges, we present PRISM (Precision Restoration with Interpretable Separation of Mixtures). PRISM is a prompted conditional diffusion framework which combines compound-aware supervision over mixed degradations with a weighted contrastive disentanglement objective that aligns primitives and their mixtures in the latent space. This compositional geometry enables high-fidelity joint removal of overlapping distortions while also allowing flexible, targeted fixes through natural language prompts. Across microscopy, wildlife monitoring, remote sensing, and urban weather datasets, PRISM outperforms state-of-the-art baselines on complex compound degradations, including zero-shot mixtures not seen during training. Importantly, we show that selective restoration significantly improves downstream scientific accuracy in several domains over standard ``black-box'' restoration. These results establish PRISM as a generalizable and controllable framework for high-fidelity restoration in domains where scientific utility is a priority.

Poster

P4-#3803

Rethinking Unsupervised Cross-modal Flow Estimation: Learning from Decoupled Optimization and Consistency Constraint

Runmin Zhang ⋅ Jialiang Wang ⋅ Si-Yuan Cao ⋅ Zhu Yu ⋅ Junchen Yu ⋅ Guangyi Zhang ⋅ Hui-liang Shen

This work presents DCFlow, a novel self-supervised cross-modal flow estimation framework that integrates a decoupled optimization strategy and a cross-modal consistency constraint. Unlike previous unsupervised approaches that implicitly learn flow estimation solely from appearance similarity, we introduce a decoupled optimization strategy with task-specific supervision to address modality discrepancy and geometric misalignment distinctly. This is achieved by collaboratively training a modality transfer network and a flow estimation network. To enable reliable motion supervision without ground-truth flow, we propose a geometry-aware data synthesis pipeline combined with an outlier-robust loss. Additionally, we introduce a cross-modal consistency constraint to jointly optimize both networks, significantly improving flow prediction accuracy. For evaluation, we construct a comprehensive cross-modal flow benchmark by repurposing public datasets. Experimental results demonstrate that DCFlow can be integrated with various flow estimation networks and achieves state-of-the-art performance among unsupervised approaches.

Poster

P4-#3804

Hot PATE: Private Aggregation of Distributions for Diverse Tasks

Edith Cohen ⋅ Benjamin Cohen-Wang ⋅ Xin Lyu ⋅ Jelani Nelson ⋅ Tamas Sarlos ⋅ Uri Stemmer

The Private Aggregation of Teacher Ensembles (PATE) framework enables privacy-preserving machine learning by aggregating responses from disjoint subsets of sensitive data. Adaptations of PATE to tasks with inherent output diversity such as text generation, where the desired output is a sample from a distribution, face a core tension: as diversity increases, samples from different teachers are less likely to agree, but lower agreement results in reduced utility for the same privacy requirements. Yet suppressing diversity to artificially increase agreement is undesirable, as it distorts the output of the underlying model, and thus reduces output quality. We propose Hot PATE, a variant of PATE designed for diverse generative settings. We formalize the notion of a diversity-preserving ensemble sampler and introduce an efficient sampler that provably transfers diversity without incurring additional privacy cost. Hot PATE requires only API access to proprietary models and can be used as a drop-in replacement for existing Cold PATE samplers. Our empirical evaluations corroborate and quantify the benefits, showing significant improvements in the privacy–utility trade-off on evaluated in-context learning tasks, both in preserving diversity and in returning relevant responses.

Poster

P4-#3805

Private Rate-Constrained Optimization with Applications to Fair Learning

Mohammad Yaghini ⋅ Tudor Cebere ⋅ Michael Menart ⋅ Aurélien Bellet ⋅ Nicolas Papernot

Many problems in trustworthy ML can be expressed as constraints on prediction rates across subpopulations, including group fairness constraints (demographic parity, equalized odds, etc.). In this work, we study such constrained minimization problems under differential privacy (DP). Standard DP optimization techniques like DP-SGD rely on objectives that decompose over individual examples, enabling per-example gradient clipping and noise addition. Rate constraints, however, depend on aggregate statistics across groups, creating inter-sample dependencies that violate this decomposability. To address this, we develop RaCO-DP, a DP variant of Stochastic Gradient Descent-Ascent (SGDA) that solves the Lagrangian formulation of rate constraint problems. We show that the additional privacy cost of incorporating these constraints reduces to privately estimating a histogram over the mini-batch at each step. We prove convergence of our algorithm through a novel analysis of SGDA that leverages the linear structure of the dual parameter. Empirical results show that our method Pareto-dominates existing private learning approaches under group fairness constraints and also achieves strong privacy–utility–fairness performance on neural networks.

Poster

P4-#3806

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan ⋅ Jian Liang ⋅ Ran He ⋅ Tieniu Tan

Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at https://github.com/Jasper-Yan/TRACE-RPS.

Poster

P4-#3807

Pisces: Cryptography-based Private Retrieval-Augmented Generation with Dual-Path Retrieval

Xiaojian Liang ⋅ Lushan Song ⋅ Shishuai Du ⋅ Weicheng Zhu ⋅ Tan Faith ⋅ Jun Jie Sim ⋅ Haibing Jin ⋅ Zhenghao Wu ⋅ Yingting Liu ⋅ Xin Zhang ⋅ Jiangming Yang ⋅ Pu Duan

Retrieval-augmented generation (RAG) enhances the response quality of large language models (LLMs) when handling domain-specific tasks, yet raises significant privacy concerns. This is because both the user query and documents within the knowledge base often contain sensitive or confidential information. To address these concerns, we propose $\texttt{Pisces}$, the first practical cryptography-based RAG framework that supports dual-path retrieval, while protecting both the query and documents. Along the semantic retrieval path, we reduce computation and communication overhead by leveraging a coarse-to-fine strategy. Specifically, a novel oblivious filter is used to privately select a candidate set of documents to reduce the scale of subsequent cosine similarity computations. For the lexical retrieval path, to reduce the overhead of repeatedly invoking labeled PSI, we implement a multi-instance labeled PSI protocol to compute term frequencies for BM25 scoring in a single execution. $\texttt{Pisces}$ can also be integrated with existing privacy-preserving LLM inference frameworks to achieve end-to-end privacy. Experiments demonstrate that $\texttt{Pisces}$ achieves retrieval accuracy comparable to the plaintext baselines, within a 1.87% margin.

Poster

P4-#3808

Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures

Sangyeon Yoon ⋅ Hyesoo Hong ⋅ Wonje Jeung ⋅ Albert No

Machine unlearning aims to remove specific content from trained models while preserving overall performance. However, the phenomenon of benign relearning, in which forgotten information reemerges even from benign fine-tuning data, reveals that existing unlearning methods remain fundamentally fragile. A common explanation attributes this effect to topical relevance, but we find this account insufficient. Through systematic analysis, we demonstrate that syntactic similarity, rather than topicality, is the primary driver: across benchmarks, syntactically similar data consistently trigger recovery even without topical overlap, due to their alignment in representations and gradients with the forgotten content. Motivated by this insight, we introduce syntactic diversification, which paraphrases the original forget queries into heterogeneous structures prior to unlearning. This approach effectively suppresses benign relearning, accelerates forgetting, and substantially alleviates the trade-off between unlearning efficacy and model utility.

Poster

P4-#3809

Curation Leaks: Membership Inference Attacks against Data Curation for Machine Learning

Dariush Wahdany ⋅ Matthew Jagielski ⋅ Adam Dziedzic ⋅ Franziska Boenisch

In machine learning, curation is used to select the most valuable data for improving both model accuracy and computational efficiency. Recently, curation has also been explored as a solution for private machine learning: rather than training directly on sensitive data, which is known to leak information through model predictions, the private data is used only to guide the selection of useful public data. The resulting model is then trained solely on curated public data. It is tempting to assume that such a model is privacy-preserving because it has never seen the private data. Yet, we show that without further protection, curation pipelines can still leak private information. Specifically, we introduce novel attacks against popular curation methods, targeting every major step: the computation of curation scores, the selection of the curated subset, and the final trained model. We demonstrate that each stage reveals information about the private dataset and that even models trained exclusively on curated public data leak membership information about the private data that guided curation. These findings highlight the previously overlooked inherent privacy risks of data curation and show that privacy assessment must extend beyond the training procedure to include the data selection process. Our differentially private adaptations of curation methods effectively mitigate leakage, indicating that formal privacy guarantees for curation are a promising direction.

Poster

P4-#3810

Rethinking LoRA for Privacy-Preserving Federated Learning in Large Models

JIN LIU ⋅ Yinbin Miao ⋅ Ning Xi ⋅ Junkang Liu

Fine-tuning large vision models (LVMs) and large language models (LLMs) under differentially private federated learning (DPFL) is hindered by a fundamental privacy-utility trade-off. Low-Rank Adaptation (LoRA), a promising parameter-efficient fine-tuning (PEFT) method, reduces computational and communication costs by introducing two trainable low-rank matrices while freezing pre-trained weights. However, directly applying LoRA in DPFL settings leads to performance degradation, especially in LVMs. Our analysis reveals three previously underexplored challenges: (1) gradient coupling caused by the simultaneous update of two asymmetric low-rank matrices, (2) compounded noise amplification under differential privacy, and (3) sharpness of the global aggregated model in the parameter space. To address these issues, we propose LA-LoRA (\textbf{L}ocal \textbf{A}lternating \textbf{LoRA}), a novel approach that decouples gradient interactions and aligns update directions across clients to enhance robustness under stringent privacy constraints. Theoretically, LA-LoRA strengthens convergence guarantees in noisy federated environments. Extensive experiments demonstrate that LA-LoRA achieves state-of-the-art (SOTA) performance on Swin Transformer and RoBERTa models, showcasing robustness to DP noise and broad applicability across both LVMs and LLMs. For example, when fine-tuning the Swin-B model on the Tiny-ImageNet dataset under a strict privacy budget ($\epsilon = 1$), LA-LoRA outperforms the best baseline, RoLoRA, by 16.83\% in test accuracy. Code is provided in the Appendix.

Poster

P4-#3811

Information-Theoretic Membership Inference for Granular Quantification of Memorization

Jiashu Tao ⋅ Reza Shokri

Machine learning models are known to leak sensitive information, as they inevitably memorize (parts of) their training data. This risk is amplified for large language models (LLMs), which are trained on massive corpora and therefore create a more urgent need for privacy assessment prior to release. The standard method to quantify privacy is via membership inference attacks, where the state-of-the-art approach is the Robust Membership Inference Attack (RMIA). In this paper, we introduce \textbf{InfoRMIA}, a principled information-theoretic formulation of membership inference that consistently outperforms RMIA across benchmarks while improving computational efficiency. Moving beyond attack performance alone, we show that treating sequence-level membership inference as the gold standard obscures how memorization manifests in LLMs. To address this limitation, we propose a fine-grained memorization assessment framework based on token-level signals, with InfoRMIA serving as its algorithmic backbone. Our approach identifies which tokens within generated outputs are memorized, localizing privacy leakage from sequences to individual tokens. This framework enables more precise analysis of LLM memorization and potentially targeted mitigation strategies such as exact unlearning.

Poster

P4-#3812

WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

Mohammad M Maheri ⋅ Xavier Cadet ⋅ Peter Chin ⋅ Hamed Haddadi

Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to both the original and unlearned models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forgotten samples and the close proximity of the unlearned model to the original. To demonstrate their severity, we design unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (such as NGP and SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce gradient energy of forgotten samples and increase parameter dispersion while preserving accuracy. This reparameterization hides the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or to recover them through reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage by up to 64% in black-box settings and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for improving privacy in approximate unlearning.

Poster

P4-#3813

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Tianze Wang ⋅ Zhaoyu Chen ⋅ Jian Du ⋅ Yingtai Xiao ⋅ Linjun Zhang ⋅ Qiang Yan

Text data has become extremely valuable on large language models (LLMs) and even lead to general artificial intelligence (AGI). A lot of high-quality text in the real world is private and cannot be freely used due to privacy concerns. Therefore, differentially private (DP) synthetic text generation has been proposed, aiming to produce high-utility synthetic data while protecting sensitive information. However, existing DP synthetic text generation imposes uniform guarantees that often overprotect non-sensitive content, resulting in substantial utility loss and computational overhead. Therefore, we propose Secret-Protected Evolution (SecPE), a novel framework that extends private evolution with secret-aware protection. Theoretically, we show that SecPE satisfies $(\vp, \vr)$-secret protection, constituting a relaxation of Gaussian DP that enables tighter utility–privacy trade-offs, while also substantially reducing computational complexity relative to baseline methods. Empirically, across the OpenReview, PubMed, and Yelp benchmarks, SecPE consistently achieves lower Fréchet Inception Distance (FID) and higher downstream task accuracy than GDP-based Aug-PE baselines, while requiring less noise to attain the same level of protection. Our results highlight that secret-aware guarantees can unlock more practical and effective privacy-preserving synthetic text generation.

Poster

P4-#3814

Traceable Black-Box Watermarks For Federated Learning

Jiahao Xu ⋅ Rui Hu ⋅ Olivera Kotevska ⋅ Zikai Zhang

Due to the distributed nature of Federated Learning (FL) systems, each local client has access to the global model, which poses a critical risk of model leakage. Existing works have explored injecting watermarks into local models to enable intellectual property protection. However, these methods either focus on non-traceable watermarks or traceable but white-box watermarks. We identify a gap in the literature regarding the formal definition of traceable black-box watermarking and the formulation of the problem of injecting such watermarks into FL systems. In this work, we first formalize the problem of injecting traceable black-box watermarks into FL. Based on the problem, we propose a novel server-side watermarking method, $\mathbf{TraMark}$, which creates a traceable watermarked model for each client, enabling verification of model leakage in black-box settings. To achieve this, $\mathbf{TraMark}$ partitions the model parameter space into two distinct regions: the main task region and the watermarking region. Subsequently, a personalized global model is constructed for each client by aggregating only the main task region while preserving the watermarking region. Each model then learns a unique watermark exclusively within the watermarking region using a distinct watermark dataset before being sent back to the local client. Extensive results across various FL systems demonstrate that $\mathbf{TraMark}$ ensures the traceability of all watermarked models while preserving their main task performance. The code is available at \url{https://github.com/JiiahaoXU/TraMark}.

Poster

P4-#3815

Heterogeneous Federated Fine-Tuning with Parallel One-Rank Adaptation

Zikai Zhang ⋅ Rui Hu ⋅ Jiahao Xu

Large Language Models (LLMs) have demonstrated remarkable effectiveness in adapting to downstream tasks through fine-tuning. Federated Learning (FL) extends this capability by enabling collaborative fine-tuning across distributed clients using Low-Rank Adaptation (LoRA), while preserving data privacy by avoiding raw data sharing. However, practical deployments face challenges when clients have heterogeneous resources and thus adopt different LoRA ranks, leading to substantial initialization and aggregation noise that undermines performance. To address these challenges, we propose Fed-PLoRA, a novel lightweight heterogeneous federated fine-tuning (FFT) framework. Fed-PLoRA introduces Parallel One-Rank Adaptation (PLoRA), a new LoRA variant that replaces the classic multi-rank LoRA module with multiple parallel one-rank modules, and a novel Select-N-Fold strategy that folds untrained PLoRA modules into the pre-trained weights before local training, thereby accommodating heterogeneous client resources. We provide a unified analysis of initialization and aggregation noise of Fed-PLoRA and demonstrate how it addresses the limitations of state-of-the-art methods. Extensive experiments on diverse LLM fine-tuning tasks demonstrate that Fed-PLoRA consistently outperforms existing methods in both accuracy and efficiency. The code is available at \url{https://github.com/TNI-playground/Fed-PLoRA}.

Poster

P4-#3816

LUMINA: Detecting Hallucinations in RAG System with Context–Knowledge Signals

Min-Hsuan Yeh ⋅ Sharon Li ⋅ Tanwi Mallick

Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context--knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality. LUMINA: https://github.com/deeplearning-wisc/LUMINA

Poster

P4-#3817

CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model

Xinran Xu ⋅ Xiuyi Fan

Accurate estimation of uncertainty in deep learning is critical for deploying models in high-stakes domains such as medical diagnosis and autonomous decision-making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a model’s uncertainty and the type of uncertainty it represents can support risk-aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real-world systems. We introduce CUPID (Comprehensive Uncertainty Plug-in estImation moDel), a general-purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the model’s internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out-of-distribution detection. The results show that it consistently delivers competitive performance while offering layer-wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model-agnostic, CUPID supports more transparent and trustworthy AI.

Poster

P4-#3818

SABRE-FL: Selective and Accurate Backdoor Rejection for Federated Prompt Learning

Momin Khan ⋅ Yasra Chandio ⋅ Fatima Anwar

Federated Prompt Learning has emerged as a communication-efficient and privacy-preserving paradigm for adapting large vision-language models like CLIP across decentralized clients. However, the security implications of this setup remain underexplored. In this work, we present the first study of backdoor attacks in Federated Prompt Learning. We show that when malicious clients inject visually imperceptible, learnable noise triggers into input images, the global prompt learner becomes vulnerable to targeted misclassification while still maintaining high accuracy on clean inputs. Motivated by this vulnerability, we propose SABRE-FL, a lightweight, modular defense that filters poisoned prompt updates using an embedding-space anomaly detector trained offline on out-of-distribution data. SABRE-FL requires no access to raw client data or labels and generalizes across diverse datasets. We show, both theoretically and empirically, that malicious clients can be reliably identified and filtered using an embedding-based detector. Across five diverse datasets and four baseline defenses, SABRE-FL outperforms all baselines by significantly reducing backdoor accuracy while preserving clean accuracy, demonstrating strong empirical performance and underscoring the need for robust prompt learning in future federated systems.

Poster

P4-#3918

DynaGuard: A Dynamic Guardian Model With User-Defined Policies

Monte Hoover ⋅ Vatsal Baherwani ⋅ Neel Jain ⋅ Khalid Saifullah ⋅ Joseph J Vincent ⋅ Chirag Jain ⋅ Melissa Rad ⋅ C. Bruss ⋅ Ashwinee Panda ⋅ Tom Goldstein

Guardian models play a crucial role in ensuring the safety and ethical behavior of user-facing AI applications by enforcing guardrails and detecting harmful content. While standard guardian models are limited to predefined, static harm categories, we introduce DynaGuard, a suite of dynamic guardian models offering novel flexibility by evaluating text based on user-defined policies, and DynaBench, a dataset for training and evaluating dynamic guardian models. Our models provide both rapid detection of policy violations and a chain-of-thought reasoning option that articulate and justify model outputs. Critically, DynaGuard not only surpasses static models in detection accuracy on traditional safety categories, but is competitive with frontier reasoning models on free-form policy violations, all in a fraction of the time. This makes DynaGuard an critical tool for language model guardrails.

Poster

P4-#3917

On The Fragility of Benchmark Contamination Detection in Reasoning Models

Han Wang ⋅ Haoyu Li ⋅ Brian Ko ⋅ Huan Zhang

Leaderboards for large reasoning models (LRMs) have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Despite that numerous contamination detection approaches have been proposed, surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via supervised fine-tuning (SFT) and reinforcement learning (RL), we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief Group Relative Policy Optimization (GRPO) training can markedly \textbf{conceal contamination signals} that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that Proximal Policy Optimization (PPO) style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that \textbf{a broad class of RL methods} may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods \textbf{perform near random guesses}. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

Poster

P4-#3916

AtC: Aggregate-then-Calibrate for Human-centered Assessment

Zejun Xie ⋅ Xintong Li ⋅ Guang Wang ⋅ Desheng Zhang

Human-centered assessment tasks, which are essential for systematic decision-making, rely heavily on human judgment and typically lack verifiable ground truth. Existing approaches face a dilemma: methods using only human judgments suffer from heterogeneous expertise and inconsistent rating scales, while methods using only model-generated scores must learn from imperfect proxies or incomplete features. We propose Aggregate-then-Calibrate (AtC), a two-stage framework that combines these complementary sources. Stage-1 aggregates heterogeneous comparative judgments into a consensus ranking $\hat{\pi}$ using a rank-aggregation model that accounts for annotator reliability. Stage-2 calibrates any predictive model’s scores by an isotonic projection onto the order $\hat{\pi}$, enforcing ordinal consistency while preserving as much of the model’s quantitative information as possible. Theoretically, we show: (1) modeling annotator heterogeneity yields strictly more efficient consensus estimation than homogeneity; (2) isotonic calibration enjoys risk bounds even when the consensus ranking is misspecified; and (3) AtC asymptotically outperforms model-only assessment. Across semi-synthetic and real-world datasets, AtC consistently improves accuracy and robustness over human-only or model-only assessments. Our results bridge judgment aggregation with model-free calibration, providing a principled recipe for human-centered assessment when ground truth is costly, scarce, or unverifiable.

Poster

P4-#3915

Weak-to-Strong Generalization with Failure Trajectories

Ruimeng Ye ⋅ Zihan Wang ⋅ Yang Xiao ⋅ Zinan Ling ⋅ Manling Li ⋅ Bo Hui

Weak-to-Strong generalization (W2SG) is a new trend to elicit the full capabilities of a strong model with supervision from a weak model. While existing W2SG studies focus on simple tasks like binary classification, we extend this paradigm to complex interactive decision-making environments. Specifically, we fine-tune a strong model with trajectories of intermediate actions generated by a weak model. Motivated by the human learning process, we propose to generalize not only successful knowledge but also failed experiences so that the strong model can learn from the failed trajectories accumulated by weak models. To effectively and efficiently elicit the potential of strong agents, we further construct ``trajectory trees," a hierarchical representation that organizes weak model-generated action trajectories, coupled with Monte Carlo Tree Search (MCTS) to optimize the strong model. Through theoretical analysis, we provide formal guarantees for the effectiveness of our method in improving W2SG performance. Our empirical evaluations demonstrate substantial improvements in reasoning and decision-making capabilities across diverse task domains, validating the scalability and robustness of our proposed framework. Our code is available at: https://github.com/yeruimeng/TraTree.git.

Poster

P4-#3914

Steering Language Models with Weight Arithmetic

Constanza Fierro ⋅ Fabien Roger

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes---one that induces the desired behavior and another that induces its opposite---and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an "evil" weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

Poster

P4-#3913

When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Xinyu Zhou ⋅ Chang Jin ⋅ Carsten Eickhoff ⋅ Zhijiang Guo ⋅ Seyed Ali Bahrainian

Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering (QA), where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce pipelines including one that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by 3.46% and 5.80% in Exact Match on TimeQA-Easy and -Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by 20% over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study presents new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs. Dataset and code is publicly released https://github.com/Blackzxy/AbstentionTemporalQA.

Poster

P4-#3912

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment

Mengxuan Hu ⋅ Vivek Datla ⋅ Anoop Kumar ⋅ Zihan Guan ⋅ Sheng Li ⋅ Alfy Samuel ⋅ Daben Liu

Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.

Poster

P4-#3910

What happens when generative AI models train recursively on each others' outputs?

Hung Vu ⋅ Galen Reeves ⋅ Emily Wenger

The internet serves as a common source of training data for generative AI (genAI) models but is increasingly populated with AI-generated content. This duality raises the possibility that future genAI models may be trained on other models' generated outputs. Prior work has studied consequences of models training on their own generated outputs, but limited work has considered what happens if models ingest content produced by other models. Given society's increasing dependence on genAI tools, understanding such data-mediated model interactions is critical. This work provides empirical evidence for how data-mediated interactions might unfold in practice, develops a theoretical model for this interactive training process, and experimentally validates the theory. We find that data-mediated interactions can benefit models by exposing them to novel concepts perhaps missed in original training data, but also can homogenize their performance on shared tasks.

Poster

P4-#3909

Towards Scalable Oversight via Partitioned Human Supervision

Ren Yin ⋅ Takashi Ishida ⋅ Masashi Sugiyama

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging. Our focus is on tasks that require deep knowledge and skills of multiple domains, where this bottleneck is severe. Unfortunately, even the best human experts are knowledgeable only in a single narrow area, and will not be able to evaluate the correctness of advanced AI systems on such superhuman tasks. However, based on their narrow expertise, humans may provide a weak signal, i.e., a complementary label indicating an option that is incorrect. For example, a cardiologist could state that ''this is not related to any cardiovascular disease,'' even if they cannot identify the true disease. Based on this weak signal, we propose a scalable oversight framework that enables us to evaluate frontier AI systems without the need to prepare the ground truth. We derive an unbiased estimator of top-1 accuracy from complementary labels and quantify how many complementary labels are needed to match the variance of ordinary labels. We further introduce two estimators to combine scarce ordinary labels with abundant complementary labels. We provide finite-sample deviation guarantees for both complementary-only and the mixed estimators. Empirically, we show that we can evaluate the output of large language models without the ground truth, if we have complementary labels. We further show that we can train an AI system with such weak signals: we show how we can design an agentic AI system automatically that can improve itself with this partitioned human supervision. Our code is available at https://github.com/R-Yin-217/Towards-Scalable-Oversight-via-Partitioned-Human-Supervision.

Poster

P4-#3908

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Jingdi Lei ⋅ Varun Gumma ⋅ Rishabh Bhardwaj ⋅ Seok Lim ⋅ Chuan Li ⋅ Amir Zadeh ⋅ Soujanya Poria

Large Language Model (LLM) safety is one of the most pressing challenges for enabling wide-scale deployment. While most studies and global discussions focus on generic harms, such as models assisting users in harming themselves or others, enterprises face a more fundamental concern: whether LLM-based agents are safe for their intended use case. To address this, we introduce operational safety, defined as an LLM’s ability to appropriately accept or refuse user queries when tasked with a specific purpose. We further propose OffTopicEval, an evaluation suite and benchmark for measuring operational safety both in general and within specific agentic use cases. Our evaluations on six model families comprising 20 open-weight LLMs reveal that while performance varies across models, all of them remain highly operationally unsafe. Even the strongest models—Qwen-3 (235B) with 77.77% and Mistral (24B) with 79.96%—fall far short of reliable operational safety, while GPT models plateau in the 62–73% range, Phi achieves only mid-level scores (48–70%), and Gemma and Llama-3 collapse to 39.53% and 23.84%, respectively. While operation safety is core model's alignment issue, to suppress these failures, we propose prompt-based steering methods, query grounding (Q-ground), and system-prompt grounding (P-ground), which substantially improve OOD refusal. Q-ground provides consistent gains of up to 23%, while P-ground delivers even larger boosts, raising Llama-3.3 (70B) by 41% and Qwen-3 (30B) by 27%. These results highlight both the urgent need for operational safety interventions and the promise of prompt-based steering as a first step toward more reliable LLM-based agents. Our code and data are released at \url{https://github.com/declare-lab/OffTopicEval}.

Poster

P4-#3907

Do Large Language Models Know What They Are Capable Of?

Casey Barkan ⋅ Sidney Black ⋅ Oliver Sourbut

We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs’ decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

Poster

P4-#3906

SERUM: Simple, Efficient, Robust, and Unifying Marking for Diffusion-based Image Generation

Jan Kociszewski ⋅ Hubert Jastrzębski ⋅ Tymoteusz Stępkowski ⋅ Filip Manijak ⋅ Krzysztof Rojek ⋅ Franziska Boenisch ⋅ Adam Dziedzic

We propose SERUM: an intriguingly simple yet highly effective method for marking images generated by diffusion models (DMs). We only add a unique watermark noise to the initial diffusion generation noise and train a lightweight detector to identify watermarked images, simplifying and unifying the strengths of prior approaches. SERUM provides robustness against any image augmentations or watermark removal attacks and is extremely efficient, all while maintaining negligible impact on image quality. In contrast to prior approaches, which are often only resilient to limited perturbations and incur significant training, injection, and detection costs, our SERUM achieves remarkable performance, with the highest true positive rate (TPR) at a 1% false positive rate (FPR) in most scenarios, along with fast injection and detection and low detector training overhead. Its decoupled architecture also seamlessly supports multiple users by embedding individualized watermarks with little interference between the marks. Overall, our method provides a practical solution to mark outputs from DMs and to reliably distinguish generated from natural images.

Poster

P4-#3905

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

Wenbo Pan ⋅ Jie Xu ⋅ Qiguang Chen ⋅ Junhao Dong ⋅ Libo Qin ⋅ Xinfeng Li ⋅ Yu Haining ⋅ Xiaohua Jia

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

Poster

P4-#3904

A Guardrail for Safety Preservation: When Safety-Sensitive Subspace Meets Harmful-Resistant Null-Space

Bingjie Zhang ⋅ Yibo Yang ⋅ Ren Zhe ⋅ Dandan Guo ⋅ Jindong Gu ⋅ Philip Torr ⋅ Bernard Ghanem

Large language models (LLMs) have achieved remarkable success in diverse tasks, yet their safety alignment remains fragile during adaptation. Even when fine-tuning on benign data or with low-rank adaptation, pre-trained safety behaviors are easily degraded, leading to harmful responses in the fine-tuned models. To address this challenge, we propose GuardSpace, a guardrail framework for preserving safety alignment throughout fine-tuning, composed of two key components: a safety-sensitive subspace and a harmful-resistant null space. First, we explicitly decompose pre-trained weights into safety-relevant and safety-irrelevant components using covariance-preconditioned singular value decomposition, and initialize low-rank adapters from the safety-irrelevant ones, while freezing safety-relevant components to preserve their associated safety mechanism. Second, we construct a null space projector that restricts adapter updates from altering safe outputs on harmful prompts, thereby maintaining the original refusal behavior. Experiments with various pre-trained models on multiple downstream tasks demonstrate that GuardSpace achieves superior performance over existing methods. Notably, for Llama-2-7B-Chat fine-tuned on GSM8K, GuardSpace outperforms the state-of-the-art method AsFT, reducing the average harmful score from 14.4\% to 3.6\%, while improving the accuracy from from 26.0\% to 28.0\%.

Poster

P4-#3903

Guidance Watermarking for Diffusion Models

Enoal Gesny ⋅ Eva Giboulot ⋅ Teddy Furon ⋅ Vivien Chappelier

This paper introduces a novel watermarking method for diffusion models. It is based on guiding the diffusion process using the gradient computed from any off-the-shelf watermark decoder. The gradient is guided further using different image augmentations, increasing robustness to attacks against which the decoder was not originally robust, without retraining or fine-tuning. The methodology effectively allows to convert any post-hoc watermarking scheme into a scheme embedding the signal during the diffusion process. We show that this approach is complementary to watermarking techniques modifying the variational autoencoder at the end of the diffusion process. We validate the methods on different diffusion models and detectors. The watermarking guidance does not significantly alter the generated image for a given seed and prompt, preserving both the diversity and quality of generation.

Poster

P4-#3902

Practical estimation of the optimal classification error with soft labels and calibration

Ryota Ushio ⋅ Takashi Ishida ⋅ Masashi Sugiyama

While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory. The code is available at https://github.com/RyotaUshio/bayes-error-estimation.

Poster

P4-#3901

Dissecting Representation Misalignment in Contrastive Learning via Influence Function

Huanyi Xie ⋅ Chenyang Ren ⋅ Khouloud Saadi ⋅ Shu Yang ⋅ Zhen Tan ⋅ Jingfeng Zhang ⋅ Lijie Hu ⋅ Di Wang

Contrastive learning, commonly applied in large-scale multimodal models, often relies on data from diverse and often unreliable sources, which can include misaligned or mislabeled text-image pairs. This frequently leads to robustness issues and hallucinations, ultimately causing performance degradation. Data valuation is an efficient way to detect and trace these misalignments. Nevertheless, existing methods are computationally expensive for large-scale models. Although computationally efficient, classical influence functions are inadequate for contrastive learning models, as they were initially designed for pointwise loss. Furthermore, contrastive learning involves minimizing the distance between positive sample modalities while maximizing the distance between negative sample modalities. This necessitates evaluating the influence of samples from both perspectives. To tackle these challenges, we introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss. ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models, eliminating the need for retraining. Building upon ECIF, we develop a series of algorithms for data evaluation, misalignment detection, and misprediction trace-back tasks. Experimental results demonstrate that our ECIF advances the transparency and interpretability of CLIP-style embedding models by offering a more accurate assessment of data impact and model alignment compared to traditional baseline methods.

Poster

P4-#4001

Decomposing LLM Computation with Jets

Yihong Chen ⋅ Xiangxiang Xu ⋅ Pontus Stenetorp ⋅ Sebastian Riedel ⋅ Luca Franceschi

Large language models are becoming general knowledge engines for diverse applications. However, their computations are deeply entangled after training, resisting modularization which complicates interpretability, auditing, and long-term maintenance. We introduce Jet Expansions, a framework for expanding computational graphs using jet operators that generalize truncated Taylor series. Our method systematically decomposes language models into explicit input-to-output computational paths and complementary remainders. This functional decomposition provides a principled, knife-like operator for cutting through entanglement in LLMs, enabling scalable model inspection. We demonstrate how Jet Expansions ground and subsume the popular interpretability technique Logit Lens, reveal a (super-)exponential path structure with respect to recursive residual depth, and support several interpretability applications, including sketching a transformer language model with $n$-gram statistics extracted from its computations and indexing model toxicity levels *without* curated benchmarks.

Poster

P4-#4002

The Value of Information in Human-AI Decision-making

Ziyang Guo ⋅ Yifan Wu ⋅ Jason Hartline ⋅ Jessica Hullman

Multiple agents are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents requires knowing what information and strategies each agent employs. With a focus on human-AI pairings, we contribute a decision-theoretic framework for characterizing the value of information. By defining complementary information, our approach identifies opportunities for agents to better exploit available information in AI-assisted decision workflows. We present a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to highlight human-complementing information. We validate the effectiveness of our framework and ILIV-SHAP through a study of human-AI decision-making, and demonstrate the framework on examples from chest X-ray diagnosis and deepfake detection. We find that presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in error over non-AI assisted decisions more than vanilla SHAP.

Poster

P4-#4003

SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Anjali Parashar ⋅ Yingke Li ⋅ Eric Yu ⋅ Fei Chen ⋅ James Neidhoefer ⋅ devesh upadhyay ⋅ Chuchu Fan

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on learnt qualitative preferences and objectives that align with the stakeholder preferences. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.

Poster

P4-#4004

Self-Consistency Improves the Trustworthiness of Self-Interpretable GNNs

Wenxin Tai ⋅ Ting Zhong ⋅ Goce Trajcevski ⋅ Fan Zhou

Graph Neural Networks (GNNs) achieve strong predictive performance but offer limited transparency in their decision-making. Self-Interpretable GNNs (SI-GNNs) address this by generating built-in explanations, yet their training objectives are misaligned with evaluation criteria such as faithfulness. This raises two key questions: (i) can faithfulness be explicitly optimized during training, and (ii) does such optimization truly improve explanation quality? We show that faithfulness is intrinsically tied to explanation self-consistency and can therefore be optimized directly. Empirical analysis further reveals that self-inconsistency predominantly occurs on unimportant features, linking it to redundancy-driven explanation inconsistency observed in recent work and suggesting untapped potential for improving explanation quality. Building on these insights, we introduce a simple, model-agnostic self-consistency (SC) fine-tuning strategy. Without changing model architectures, SC consistently improves explanation quality across multiple dimensions and benchmarks, offering an effective and scalable pathway to more trustworthy GNN explanations. Our code is publicly available at \url{https://github.com/ICDM-UESTC/SelfConsistencyXGNN}.

Poster

P4-#4005

Tracing the Traces: Latent Temporal Signals for Efficient and Accurate Reasoning

Martina G. Vilas ⋅ Safoora Yousefi ⋅ Besmira Nushi ⋅ Eric Horvitz ⋅ Vidhisha Balachandran

Reasoning models improve their problem-solving ability through inference-time scaling, allocating more compute via longer and multiple token budgets. Identifying which reasoning traces are likely to succeed remains a key opportunity: reliably predicting productive thinking paths can substantially reduce wasted computation and improve overall efficiency. We introduce Latent-Trajectory signals that characterize the temporal evolution of a model's internal representations during the generation of intermediate reasoning tokens. By analyzing both the extent and temporal course of latent representational change, as well as its alignment with the final state, we show that these signals are strong predictors of solution accuracy, outperforming conventional output-based confidence measures. We use latent-trajectory signals to guide answer selection across multiple sampled generations, demonstrating that they make test-time scaling more effective and efficient, reducing token usage by up to 70% while preserving and even improving accuracy on average in comparison with majority voting. Finally, we show that these signals often emerge early in the reasoning trace, which enables early selection and allocation of compute to the most promising answer candidates during generation. Our findings contribute not only practical strategies for inference-time efficiency, but also a deeper interpretability perspective on how reasoning processes are represented and differentiated in latent space.

Poster

P4-#4006

Mechanism of Task-oriented Information Removal in In-context Learning

Hakaze Cho ⋅ Haolin Yang ⋅ Gouki Minegishi ⋅ Naoya Inoue

In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

Poster

P4-#4007

Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Lorenz Hufe ⋅ Constantin Venhoff ⋅ Erblina Purelku ⋅ Maximilian Dreyer ⋅ Sebastian Lapuschkin ⋅ Wojciech Samek

Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06\% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1\%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typographic defenses that rely on finetuning. To this end, we release a family of dyslexic CLIP models which are significantly more robust against typographic attacks. These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.

Poster

P4-#4008

f-INE: A Hypothesis Testing Framework for Estimating Influence under Training Randomness

Subhodip Panda ⋅ Dhruv Tarsadiya ⋅ Shashwat Sourav ⋅ Prathosh AP ⋅ Sai Karimireddy

Influence estimation methods promise to explain and debug machine learning by estimating the impact of individual samples on the final model. Yet, existing methods collapse under training randomness: the same example may appear critical in one run and irrelevant in the next. Such instability undermines their use in data curation or cleanup since it is unclear if we indeed deleted/kept the correct datapoints. To overcome this, we introduce f-influence -- a new influence estimation framework grounded in hypothesis testing that explicitly accounts for training randomness, and establish desirable properties that make it suitable for reliable influence estimation. We also design a highly efficient algorithm f-INfluence Estimation (f-INE) that computes f-influence in a in a single training run. Finally, we scale up f-INE to estimate influence of instruction tuning data on Llama 3.1 8B and show it can reliably detect poisoned samples that steer model opinions, demonstrating its utility for data cleanup and attributing model behavior.

Poster

P4-#4009

Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning

Xinting Huang ⋅ Michael Hahn

Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these "natural" subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to "variables" used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing scalability to 2B models by finding separate subspaces mediating context and parametric knowledge routing. Viewed more broadly, our findings offer a new perspective on understanding model internals and building circuits.

Poster

P4-#4010

Tackling the XAI Disagreement Problem with Adaptive Feature Grouping

Gabriel Laberge ⋅ Ola Ahmad

Post-hoc explanations aim at understanding which input features (or groups thereof) are the most impactful toward certain model decisions. Many such methods have been proposed (ArchAttribute, Occlusion, SHAP, RISE, LIME, Integrated Gradient) and it is hard for practitioners to understand the differences between them. Even worse, faithfulness metrics, often used to quantitatively compare explanation methods, also exhibit inconsistencies. To address these issues, recent work has unified explanation methods through the lens of Functional Decomposition. We extend such work to scenarios where input features are partitioned into groups (e.g. pixel patches) and prove that disagreements between explanation methods and faithfulness metrics are caused by between-group interactions. Crucially, getting rid of between-group interactions leads to a single explanation that is optimal according to all faithfulness metrics. We finally show how to reduce the disagreements by grouping features on tabular/image data.

Poster

P4-#4011

Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions

Yuntai Bao ⋅ Xuhong Zhang ⋅ Jintao Chen ⋅ Ge Su ⋅ Yuxiang Cai ⋅ Hao Peng ⋅ SUN Bing ⋅ Haiqin Weng ⋅ Liu Yan ⋅ Jianwei Yin

Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering.

Poster

P4-#4012

Counterfactual Explanations on Robust Perceptual Geodesics

Eslam Zaher ⋅ Maciej Trzaskowski ⋅ Quan Nguyen ⋅ Fred Roosta

Latent-space optimization methods for counterfactual explanations—framed as minimal semantic perturbations that change model predictions—inherit the ambiguity of Wachter et al.’s objective: the choice of distance metric dictates whether perturbations are meaningful or adversarial. Existing approaches adopt flat or misaligned geometries, leading to off-manifold artifacts, semantic drift, or adversarial collapse. We introduce Perceptual Counterfactual Geodesics (PCG), a method that constructs counterfactuals by tracing geodesics under a perceptually Riemannian metric induced from robust vision features. This geometry aligns with human perception and penalizes brittle directions, enabling smooth, on-manifold, semantically valid transitions. Experiments on three vision datasets show that PCG outperforms baselines and reveals failure modes hidden under standard metrics.

Poster

P4-#4013

Language Models Use Lookbacks to Track Beliefs

Nikhil Prakash ⋅ Natalie Shapira ⋅ Arnab Sen Sharma ⋅ Christoph Riedl ⋅ Yonatan Belinkov ⋅ Tamar Shaham ⋅ David Bau ⋅ Atticus Geiger

How do language models (LMs) represent characters’ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs' ability to reason about characters’ beliefs using causal mediation and abstraction. We construct a dataset, CausalToM, consisting of simple stories where two characters independently change the state of two objects, potentially unaware of each other's actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating their reference information, represented as Ordering IDs (OIs), in low-rank subspaces of the state token's residual stream. When asked about a character's beliefs regarding the state of an object, the binding lookback retrieves the correct state OI and then the answer lookback retrieves the corresponding state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character's beliefs. Our work provides insights into belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.

Poster

P4-#4014

DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models

Zherui Li ⋅ Zheng Nie ⋅ Zhenhong Zhou ⋅ Yue Liu ⋅ Yitong Zhang ⋅ Yu Cheng ⋅ Qingsong Wen ⋅ Kun Wang ⋅ Yufei Guo ⋅ Jiaheng Zhang

The rapid advancement of Diffusion Large Language Models (dLLMs) introduces unprecedented vulnerabilities that are fundamentally distinct from Autoregressive LLMs, stemming from their iterative and parallel generation mechanisms. In this paper, we conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics. Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final output. These findings also indicate that while current decoding strategies constitute a significant vulnerability, dLLMs possess a substantial intrinsic safety potential. To unlock this potential, we propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach: Stochastic Annealing Remasking dynamically introduces controlled randomness to mitigate greedy selection bias, while Block-level Audit and Repair exploits internal model representations for autonomous risk detection and guided correction. Comprehensive experiments on four dLLMs demonstrate DiffuGuard's exceptional effectiveness, reducing Attack Success Rate against six diverse jailbreak methods from 47.9% to 14.7% while preserving model utility and efficiency.

Poster

P4-#4015

ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks

Zhaorun Chen ⋅ Xun Liu ⋅ Mintong Kang ⋅ Jiawei Zhang ⋅ Minzhou Pan ⋅ Shuang Yang ⋅ Bo Li

As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world adversarial strategies. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. This is the first red teaming framework that provides controllable generation given risk definitions. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms with ARMs. To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack algorithm. Extensive experiments on different instance-based benchmarks and policy-based safety evaluations show that ARMs achieves the state-of-the-art attack success rate (ASR), improving ASR by an average of 52.1% compared to existing baselines and even exceeding 90% ASR on Claude-4-Sonnet, a constitutionally-aligned model widely recognized for its robustness. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety benchmark comprising 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Fine-tuning with ARMs-Bench substantially reduces ASR while preserving general utility of VLMs, providing actionable insights to improve multimodal safety alignment.

Poster

P4-#4016

Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs

Jackson Kaunismaa ⋅ John Hughes ⋅ Christina Knight ⋅ Avery Griffin ⋅ Mrinank Sharma ⋅ Erik Jones

Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through \textit{elicitation attacks}. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40\% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

Poster

P4-#4017

Adaptive Logit Adjustment for Debiasing Multimodal Language Models

Hoin Jung ⋅ Junyi Chai ⋅ Xiaoqian Wang

Vision-Language Models (VLMs) and Large Multimodal Models (LMMs) have significantly advanced image-to-text generation tasks such as image captioning and visual question answering (VQA). However, these models often exhibit biases, including attribute misalignment between the generated text and the input image, or the reinforcement of harmful stereotypes. Existing debiasing techniques primarily focus on modifying representations at the encoder or decoder level, which can degrade model performance and may be susceptible to bias reintroduction from external sources. In this work, we propose Adaptive Logit Adjustment (ALA) for Bias Alignment and Neutralization, a post-hoc debiasing method that operates directly on logits during autoregressive text generation. Unlike prior approaches that modify internal representations, ALA selectively adjusts token probabilities to mitigate biases without distorting essential model outputs. Our approach leverages external classifiers to measure bias misalignment between image and text, applies gradient-based importance analysis to identify bias-inducing tokens, and dynamically refines token probabilities to reduce undesired biases. We evaluate ALA on image captioning and various VQA tasks, demonstrating its effectiveness in mitigating bias while maintaining contextual accuracy. Notably, our approach is applicable to various multimodal architectures in a model-agnostic manner, including VLMs and LMMs, across different tasks that involve autoregressive text generation. Our results show that logit-based debiasing offers a flexible and efficient alternative to existing encoder- and embedding-centric approaches, providing a more practical solution for building fairer multimodal AI systems.

Poster

P4-#4018

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Dawei Li ⋅ Renliang Sun ⋅ Yue Huang ⋅ Ming Zhong ⋅ Bohan Jiang ⋅ Jiawei Han ⋅ Xiangliang Zhang ⋅ Wei Wang ⋅ huan liu

Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: \url{https://github.com/David-Li0406/Preference-Leakage}.

Poster

P4-#4118

When AI Agents Collude Online: Financial Fraud Risks by Collaborative LLM Agents on Social Platforms

Qibing Ren ⋅ Zhijie Zheng ⋅ Jiaxuan Guo ⋅ Junchi Yan ⋅ Lizhuang Ma ⋅ Jing Shao

In this work, we study the risks of collective financial fraud in large-scale multi-agent systems powered by large language model (LLM) agents. We investigate whether agents can collaborate in fraudulent behaviors, how such collaboration amplifies risks, and what factors influence fraud success. To support this research, we present MultiAgentFinancialFraudBench, a large-scale benchmark for simulating financial fraud scenarios based on realistic online interactions. The benchmark covers 28 typical online fraud scenarios, spanning the full fraud lifecycle across both public and private domains. We further analyze key factors affecting fraud success, including interaction depth, activity level, and fine-grained collaboration failure modes. Finally, we propose a series of mitigation strategies including adding content-level warnings to fraudulent posts and dialogues, using LLMs as monitors to block potentially malicious agents, and fostering group resilience through information sharing at the societal level. Notably, we observe that malicious agents can adapt to environmental interventions. Our findings highlight the real-world risks of multi-agent financial fraud and suggest practical measures for mitigating them. Code is available at https://github.com/zheng977/MutiAgent4Fraud.

Poster

P4-#3413

How Dark Patterns Manipulate Web Agents

Phil Cuvin ⋅ Hao Zhu ⋅ Diyi Yang

Deceptive UI designs, widely instantiated across the web and commonly known as dark patterns, manipulate users into performing actions misaligned with their goals. In this paper, we show that dark patterns are highly effective in steer- ing agent trajectories, posing a significant risk to agent robustness. To quan- tify this risk, we introduce DECEPTICON, an environment for testing individual dark patterns in isolation. DECEPTICON includes 700 web navigation tasks with dark patterns—600 generated tasks and 100 real-world tasks, designed to measure instruction-following success and dark pattern effectiveness. Across state-of-the- art agents, we find dark patterns successfully steer agent trajectories towards mali- cious outcomes in over 70% of tested generated and real-world tasks—compared to a human average of 31%. Moreover, we find that dark pattern effectiveness correlates positively with model size and test-time reasoning, making larger, more capable models more susceptible. Leading countermeasures against adversarial attacks, including in-context prompting and guardrail models, fail to consistently reduce the success rate of dark pattern interventions. Our findings reveal dark pat- terns as a latent and unmitigated risk to web agents, highlighting the urgent need for robust defenses against manipulative designs.

Poster

P4-#4117

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Ayoung Lee ⋅ Ryan Kwon ⋅ Peter Railton ⋅ Lu Wang

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

Poster

P4-#4116

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

Liangsheng Liu ⋅ Si Chen ⋅ Jiamin Wu ⋅ Weiwei Feng ⋅ Zhixin Cheng ⋅ Xiaotian Yin ⋅ Wenfei Yang ⋅ Tianzhu Zhang

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP’s feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score–based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

Poster

P4-#4115

A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Wonje Jeung ⋅ Sangyeon Yoon ⋅ Yoonjun Cho ⋅ Dongjae Jeon ⋅ Sangwoo Shin ⋅ Hyesoo Hong ⋅ Albert No

Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks such as DIJA bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that aligns dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at the token-level under randomized masking, A2D achieves robustness to both any-decoding-order and any-step prefilling attacks under various conditions. It also enables real-time monitoring: dLLMs may begin a response but automatically terminate if unsafe continuation emerges. On safety benchmarks, A2D consistently prevents the generation of harmful outputs, slashing DIJA success rates from over 80\% to near-zero (1.3\% on LLaDA-8B-Instruct, 0.0\% on Dream-v0-Instruct-7B), and thresholded [EOS] probabilities allow early rejection, yielding up to 19.3× faster safe termination.

Poster

P4-#4114

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Geon-Hyeong Kim ⋅ Yu Jin Kim ⋅ Byoungjip Kim ⋅ Honglak Lee ⋅ Kyunghoon Bae ⋅ Youngsoo Jang ⋅ Moontae Lee

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the original safety alignment objective and show that, under mild assumptions, it admits a closed-form optimal policy. We further derive a provably equivalent and tractable objective, enabling direct optimization. Building on this insight, we propose SafeDPO, a lightweight method that preserves the optimal solution of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. SafeDPO eliminates the need for reward models, cost models, and online sampling, relying only on preference data and safety indicators. Despite its simplicity, SafeDPO achieves competitive safety–helpfulness trade-offs compared to existing safety alignment methods. Experiments on the PKU-SafeRLHF-30K benchmark demonstrate that SafeDPO substantially improves safety while maintaining competitive helpfulness. Ablation studies further show that the additional hyperparameter provides a flexible mechanism to enhance safety while preserving the theoretical optimum, and confirm that SafeDPO scales reliably to LLMs with up to 13B parameters. Overall, our results highlight that a simple, theory-driven objective can provide a lightweight yet effective solution for safety alignment in practice.

Poster

P4-#4113

No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings

Joonsung Jeon ⋅ Woo Jae Kim ⋅ Suhyeon Ha ⋅ Sooel Son ⋅ Sung-eui Yoon

Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit , a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model’s unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.

Poster

P4-#4112

Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Simon Schrodi ⋅ Elias Kempf ⋅ Fazl Barez ⋅ Thomas Brox

Language models can transfer hidden biases during distillation. For example, a teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. This surprising phenomenon is called subliminal learning. Subliminal learning can be expected under soft distillation, where the student is trained on the teacher's full next-token distribution. But the fact that this also occurs under hard distillation—where the student only sees sampled tokens—raises a deeper question: when and how does subliminal learning actually occur? We answer this question through controlled experiments and mechanistic analysis. Our results show that subliminal learning does not need (global) token entanglement or logit leakage. Instead, it comes down to a small set of divergence tokens—rare cases where teachers with different biases would predict different tokens. Masking out these tokens mostly removes the hidden bias transfer. Mechanistically, divergence tokens reveal that early layers are critical. Surprisingly, finetuning even a single such early layer is sufficient for subliminal learning. Finally, we find that subliminal learning is fragile. Even small changes, like prompt paraphrasings, are usually sufficient to suppress it.

Poster

P4-#4111

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs

Alexander Panfilov ⋅ Evgenii Kortukov ⋅ Kristina Nikolić ⋅ Matthias Bethge ⋅ Sebastian Lapuschkin ⋅ Wojciech Samek ⋅ Ameya Prabhu ⋅ Maksym Andriushchenko ⋅ Jonas Geiping

Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for \textit{dishonesty} as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are crafted to be subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool \emph{all} output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a \emph{honeypot} against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using them as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.

Poster

P4-#4110

Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots

Erfan Shayegani ⋅ G M Shahariar ⋅ Sara Abdali ⋅ Lei Yu ⋅ Nael Abu-Ghazaleh ⋅ Yue Dong

Multimodal Language Models (MMLMs) typically undergo post-training alignment to prevent harmful content generation. However, these alignment stages focus primarily on the assistant role, leaving the user role unaligned, and sticking to a fixed input prompt structure of special tokens, making the model vulnerable when inputs deviate from these expectations. We introduce Role-Modality Attacks (RMA), a novel class of adversarial attacks that exploit role confusion between the user and assistant and alter the position of the image token to elicit harmful outputs. Unlike existing attacks that modify query content, RMAs manipulate the input structure without altering the query itself. We systematically evaluate these attacks across multiple Vision Language Models (VLMs) on eight distinct settings, showing that they can be composed to create stronger adversarial prompts, as also evidenced by their increased projection in the negative refusal direction in the residual stream, a property observed in prior successful attacks. Finally, for mitigation, we propose an adversarial training approach that makes the model robust against input prompt perturbations. By training the model on a range of harmful and benign prompts all perturbed with different RMA settings, the model loses its sensitivity to Role Confusion and Modality Manipulation attacks and is trained to only pay attention to the query content in the input prompt structure, effectively reducing Attack Success Rate (ASR) while preserving the model’s general utility.

Poster

P4-#4109

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

Igor Maljkovic ⋅ Maria Rosaria Briglia ⋅ Iacopo Masi ⋅ Antonio Emanuele Cinà ⋅ Fabio Roli

Vision–Language Models (VLMs) have become essential for tasks such as image synthesis, captioning, and retrieval by aligning textual and visual information in a shared embedding space. Yet, this flexibility also makes them vulnerable to malicious prompts designed to produce unsafe content, raising critical safety concerns. Existing defenses either rely on blacklist filters, which are easily circumvented, or on heavy classifier-based systems, both of which are costly and fragile under embedding-level attacks. We address these challenges with two complementary components: Hyperbolic Prompt Espial (HyPE) and Hyperbolic Prompt Sanitization (HyPS). HyPE is a lightweight anomaly detector that leverages the structured geometry of hyperbolic space to model benign prompts and detect harmful ones as outliers. HyPS builds on this detection by applying explainable attribution methods to identify and selectively modify harmful words, neutralizing unsafe intent while preserving the original semantics of user prompts. Through extensive experiments across multiple datasets and adversarial scenarios, we prove that our framework consistently outperforms prior defenses in both detection accuracy and robustness. Together, HyPE and HyPS offer an efficient, interpretable, and resilient approach to safeguarding VLMs against malicious prompt misuse.

Poster

P4-#4108

Uncertainty Estimation via Hyperspherical Confidence Mapping

Eunseo Choi ⋅ Ho-Yeon Kim ⋅ Jaewon Lee ⋅ Taeyong jo ⋅ Myungjun lee ⋅ Heejin Ahn

Quantifying uncertainty in neural network predictions is essential for deploying models in high-stakes domains such as autonomous driving, healthcare, and manufacturing. While conventional approaches often depend on costly sampling or parametric distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for uncertainty estimation that is both sampling-free and distribution-free. HCM decomposes model outputs into a magnitude and a normalized direction vector constrained to lie on a unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of a geometric constraint. Grounded in this geometric constraint formulation, our method provides deterministic and interpretable uncertainty estimates applicable to both regression and classification. We validate the effectiveness of HCM across diverse benchmarks and real-world industrial tasks, demonstrating competitive or superior performance to ensemble and evidential approaches, while significantly reducing inference cost and ensuring strong confidence–error alignment. Our results highlight the value of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.

Poster

P4-#3310

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Myra Cheng ⋅ Sunny Yu ⋅ Cinoo Lee ⋅ Pranav Khadpe ⋅ Lujain Ibrahim ⋅ Dan Jurafsky

LLMs are known to exhibit sycophancy: agreeing with and flattering users, even at the cost of correctness. Prior work measures sycophancy only as direct agreement with users' explicitly stated beliefs that can be compared to a ground truth. This fails to capture broader forms of sycophancy such as affirming a user's self-image or other implicit beliefs. To address this gap, we introduce social sycophancy, characterizing sycophancy as excessive preservation of a user’s face (their desired self-image), and present ELEPHANT, a benchmark for measuring social sycophancy in LLMs. Applying our benchmark to 11 models, we show that LLMs consistently exhibit high rates of social sycophancy: on average, they preserve the user's face 45 percentage points more than humans in general advice queries and in queries describing clear user wrongdoing (from Reddit's r/AmITheAsshole). Furthermore, when prompted with perspectives from either side of a moral conflict, LLMs affirm whichever side the user adopts in 48% of cases—telling both the at-fault party and the wronged party that they are not wrong—rather than adhering to a consistent moral or value judgment. We further show that social sycophancy is rewarded in preference datasets. We present both prompting- and steering-based mitigation strategies to reduce social sycophancy, though understanding when and how to apply them without compromising user experience remains an open question. Our work provides theoretical and empirical tools for broadly understanding and addressing LLM sycophancy.

Poster

P4-#4107

Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs

Hyejun Jeong ⋅ Shiqing Ma ⋅ Amir Houmansadr

Large Language Models (LLMs) reproduce social biases, yet prevailing evaluations score models in isolation, obscuring how biases persist across families and releases. We introduce Bias Similarity Measurement (BSM), which treats fairness as a relational property between models, unifying scalar, distributional, behavioral, and representational signals into a single similarity space. Evaluating 30 LLMs on 1M+ prompts, we find that instruction tuning primarily enforces abstention rather than altering internal representations; small models gain little accuracy and can become less fair under forced choice; and in our evaluation setting, open-weight models can match or exceed proprietary systems. Family signatures diverge: Gemma favors refusal, LLaMA 3.1 approaches neutrality with fewer refusals, and converges toward abstention-heavy behavior overall. Counterintuitively, Gemma 3 Instruct matches GPT-4--level fairness at far lower cost, whereas Gemini’s heavy abstention suppresses utility. Beyond these findings, BSM offers an auditing workflow for procurement, regression testing, and lineage screening, and extends naturally to code and multilingual settings. Our results reframe fairness not as isolated scores but as comparative bias similarity, enabling systematic auditing of LLM ecosystems. Code is available at https://github.com/HyejunJeong/bias_llm.

Poster

P4-#4106

Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield ⋅ Philip Torr ⋅ Ioannis Patras ⋅ Adel Bibi ⋅ Fazl Barez

Monitoring large language models' (LLMs) activations is an effective way to detect harmful requests before they lead to unsafe outputs. However, traditional safety monitors often require the same amount of compute for every query. This creates a trade-off: expensive monitors waste resources on easy inputs, while cheap ones risk missing subtle cases. We argue that safety monitors should be flexible--costs should rise only when inputs are difficult to assess, or when more compute is available. To achieve this, we introduce Truncated Polynomial Classifiers (TPCs), a natural extension of linear probes for dynamic activation monitoring. Our key insight is that polynomials can be trained and evaluated progressively, term-by-term. At test-time, one can early-stop for lightweight monitoring, or use more terms for stronger guardrails when needed. TPCs provide two modes of use. First, as a safety dial: by evaluating more terms, developers and regulators can "buy" stronger guardrails from the same model. Second, as an adaptive cascade: clear cases exit early after low-order checks, and higher-order guardrails are evaluated only for ambiguous inputs, reducing overall monitoring costs. On two large-scale safety datasets (WildGuardMix and BeaverTails), for 4 models with up to 30B parameters, we show that TPCs compete with or outperform MLP-based probe baselines of the same size, all the while being more interpretable than their black-box counterparts. Our anonymous code is available at https://github.com/james-oldfield/tpc/.

Poster

P4-#4104

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

Zicheng Liu ⋅ Lige Huang ⋅ Jie Zhang ⋅ Dongrui Liu ⋅ Yuan Tian ⋅ Jing Shao

The increasing autonomy of Large Language Models (LLMs) necessitates a rigorous evaluation of their potential to aid in cyber offense. Existing benchmarks often lack real-world complexity and are thus unable to accurately assess LLMs' cybersecurity capabilities. To address this gap, we introduce PACEbench, a practical AI cyber-exploitation benchmark built on the principles of realistic vulnerability difficulty, environmental complexity, and cyber defenses. Specifically, PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations. To handle these complex challenges, we propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation. Extensive experiments with seven frontier LLMs demonstrate that current models struggle with complex cyber scenarios, and none can bypass defenses. These findings suggest that current models do not yet pose a generalized cyber offense threat. Nonetheless, our work provides a robust benchmark to guide the trustworthy development of future models.

Poster

P4-#4102

A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models

Shihab Ahamed ⋅ Udaya Sampath K Perera Miriya Thanthrige ⋅ Ranga Rodrigo ⋅ Muhammad Haris Khan

Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code is available at https://github.com/MB-Shihab-Aaqil-Ahamed/A-TPT/.

Poster

P4-#4101

PoliCon: Evaluating LLMs on Achieving Diverse Political Consensus Objectives

Zhaowei Zhang ⋅ Xiaobo Wang ⋅ Minghua Yi ⋅ Mengmeng Wang ⋅ Fengshuo Bai ⋅ Zilong Zheng ⋅ Yipeng Kang ⋅ Yaodong Yang

Achieving political consensus is crucial yet challenging for the effective functioning of social governance. However, although frontier AI systems represented by large language models (LLMs) have developed rapidly in recent years, their capabilities in this scope are still understudied. In this paper, we introduce PoliCon, a novel benchmark constructed from 2,225 high-quality deliberation records of the European Parliament over 13 years, ranging from 2009 to 2022, to evaluate the ability of LLMs to draft consensus resolutions based on divergent party positions under varying collective decision-making contexts and political requirements. Specifically, PoliCon incorporates four factors to build each task environment for finding different political consensus: specific political issues, political goals, participating parties, and power structures based on seat distribution. We also developed an evaluation framework based on social choice theory for PoliCon, which simulates the real voting outcomes of different political parties to assess whether LLM-generated resolutions meet the requirements of the predetermined political consensus. Our experimental results demonstrate that even state-of-the-art models remain undersatisfied with complex tasks like passing resolutions by a two-thirds majority and addressing security issues, while uncovering their inherent partisan biases and revealing some behaviors LLMs show to achieve the consensus, such as prioritizing the stance of the dominant party instead of uniting smaller parties, which highlights PoliCon's promise as an effective platform for studying LLMs' ability to promote political consensus. The code and dataset are released at PoliCon Website.

Poster

P4-#4201

SumRA: Parameter Efficient Fine-tuning with Singular Value Decomposition and Summed Orthogonal Basis

Chin Yuen Kwok ⋅ Yongsen Zheng ⋅ Jia Qi Yip ⋅ Kwok Yan Lam ⋅ Ensiong Chng

Parameter-efficient fine-tuning (PEFT) aims to adapt large pretrained speech models using fewer trainable parameters while maintaining performance. Low-Rank Adaptation (LoRA) achieves this by decomposing weight updates into two low-rank matrices, $A$ and $B$, such that $W'=W_0+BA$. Previous studies showed that freezing $A$ and only updating $B$ can reduce trainable parameters and achieve performance close to standard LoRA, where $A$ is initialized using the principal singular vectors of $W_0$ obtained via singular value decomposition (SVD). However, because $A$ is typically initialized with only the leading singular vectors, its representation capacity is confined to a narrow subspace of the model’s knowledge. To overcome this limitation, we propose SumRA, which initializes each row of $A$ as a sum of multiple singular vectors chosen from beyond the leading components, thereby enabling $A$ to influence a larger portion of the model’s knowledge space. Experiments on multilingual automatic speech recognition (ASR) tasks show that by adapting Whisper to five new languages from Common Voice with only 10 hours of data each, our method improves word error rate from 14.42\% to 12.41\% over LoRA baselines while using 50\% less trainable parameters.

Poster

P4-#3011

What Do Large Language Models Know About Opinions?

Erfan Jahanparast ⋅ Zhiqing Hong ⋅ Serina Chang

What large language models (LLMs) know about human opinions has important implications for aligning LLMs with human values, simulating humans with LLMs, and understanding what LLMs learn during training. While prior works have tested LLMs' knowledge of opinions via their next-token outputs, we present the first study to probe LLMs' internal knowledge of opinions, evaluating LLMs across 22 demographic groups on a wide range of topics. First, we show that LLMs' internal knowledge of opinions far exceeds what is revealed by their outputs, with a 52-66\% improvement in alignment with the human answer distribution; this improvement is competitive with fine-tuning but nearly 300$\times$ less computationally expensive. Second, we find that knowledge of opinions emerges rapidly in the middle layers of the LLM and identify the final unembeddings as the source of the discrepancy between internal knowledge and outputs. Third, using sparse autoencoders, we trace the knowledge of opinions in the LLM's residual stream back to attention heads, and we identify specific attention head features that selectively encode different demographic groups. Through steerability experiments, we show that manipulating these features causally alters the LLM's outputs, aligning them more or less closely with different groups. These findings open new avenues for building value-aligned and computationally efficient LLMs, with applications in survey research, social simulation, and human-centered AI. Our code is available at https://github.com/schang-lab/llm-opinions.

Poster

P4-#4202

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

Amir Joudaki ⋅ Giulia Lanzillotta ⋅ Mohammad Samragh ⋅ Iman Mirzadeh ⋅ Keivan Alizadeh-Vahid ⋅ Thomas Hofmann ⋅ Mehrdad Farajtabar ⋅ Fartash Faghri

Deep learning models excel in stationary settings but suffer from loss of plasticity (LoP) in non-stationary environments. While prior literature characterizes LoP through symptoms like rank collapse of representations, it often lacks a mechanistic explanation for why gradient descent fails to recover from these states. This work presents a first-principles investigation grounded in dynamical systems theory, formally defining LoP not merely as a statistical degradation, but as an entrapment of gradient dynamics within invariant sub-manifolds of the parameter space. We identify two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Crucially, our framework uncovers a fundamental tension: the very mechanisms that promote generalization in static settings, such as low-rank compression, actively steer the network into these LoP manifolds. We validate our theoretical analysis with numerical simulations and demonstrate how architectural interventions can destabilize these manifolds to restore plasticity.

Poster

P4-#4203

Activation Function Design Sustains Plasticity in Continual Learning

Lute Lillo ⋅ Nick Cheney

In independent, identically distributed (i.i.d.) training regimes, activation functions have been benchmarked extensively, and their differences often shrink once model size and optimization are tuned. In continual learning, however, the picture is different: beyond catastrophic forgetting, models can progressively lose the ability to adapt—loss of plasticity—and the role of the non-linearity in this failure mode remains underexplored. We show that activation choice is a primary, architecture-agnostic lever for mitigating plasticity loss. Building on a property-level analysis of negative-branch shape and saturation behavior, we introduce two drop-in nonlinearities—Smooth-Leaky and Randomized Smooth-Leaky—and evaluate them in two complementary settings: (i) supervised class-incremental benchmarks and (ii) reinforcement learning with non-stationary MuJoCo environments designed to induce controlled distribution and dynamics shifts. We also provide a simple stress protocol and diagnostics that link the shape of the activation to the adaptation under change. The takeaway is straightforward: thoughtful activation design offers a lightweight, domain-general way to sustain plasticity in continual learning without extra capacity or task-specific tuning.

Poster

P4-#4204

Co-LoRA: Collaborative Model Personalization on Heterogeneous Multi-Modal Clients

Minhyuk Seo ⋅ Taeheon Kim ⋅ Hankook Lee ⋅ Jonghyun Choi ⋅ Tinne Tuytelaars

As AI becomes more personal, e.g., Agentic AI, there is an increasing need for personalizing models for various use cases. Personalized federated learning (PFL) enables each client to collaboratively leverage other clients' knowledge for better adaptation to the task of interest, without privacy risks. Despite its potential, existing PFL methods remain confined to rather simplified scenarios where data and models are the same across clients. To move towards realistic scenarios, we move beyond these restrictive assumptions by addressing both data and model heterogeneity. We propose a task-relevance-aware model aggregation strategy to reduce parameter interference under heterogeneous data. Moreover, we introduce Co-LoRA, a dimension-invariant module that enables knowledge sharing across heterogeneous architectures. To mimic the real-world task diversity, we propose a multi-modal PFL benchmark spanning 40 distinct tasks with distribution shifts over time. Extensive experiments shows that our proposed method significantly outperforms the state-of-the-art PFL methods under heterogeneous scenarios. Code is available at https://github.com/snumprlab/fedmosaic.

Poster

P4-#4205

PAS: Estimating the target accuracy before domain adaptation

Raphaella Diniz ⋅ Jackson de Faria ⋅ Martin Ester

The goal of domain adaptation is to make predictions for unlabeled samples from a target domain with the help of labeled samples from a different but related source domain. The performance of domain adaptation methods is highly influenced by the choice of source domain and pre-trained feature extractor. However, the selection of source data and pre-trained model is not trivial due to the absence of a labeled validation set for the target domain and the large number of available pre-trained models. In this work, we propose PAS, a novel score designed to estimate the transferability of a source domain set and a pre-trained feature extractor to a target classification task before actually performing domain adaptation. PAS leverages the generalization power of pre-trained models and assesses source-target compatibility based on the pre-trained feature embeddings. We integrate PAS into a framework that indicates the most relevant pre-trained model and source domain among multiple candidates, thus improving target accuracy while reducing the computational overhead. Extensive experiments on image classification benchmarks demonstrate that PAS correlates strongly with actual target accuracy and consistently guides the selection of the best-performing pre-trained model and source domain for adaptation.

Poster

P4-#4206

Designing Rules to Pick a Rule: Aggregation by Consistency

Ratip Emin Berker ⋅ Ben Armstrong ⋅ Vincent Conitzer ⋅ Nihar Shah

Rank aggregation has critical applications for developing AI agents, as well as for evaluating them. However, different methods can give rise to significantly different aggregate rankings, impacting these applications. Indeed, work in social choice and statistics has produced many rank aggregation methods, each with its desirable properties, but also with its limitations. Given this trade-off, how do we decide which aggregation rule to use, i.e., what is a good rule picking rule (RPR)? In this paper, we design a data-driven RPR that identifies the best method for each dataset without assuming a generative model. The principle behind our RPR is to maximize consistency if the data collection process was repeated. We show that our method satisfies several consistency-related axioms failed by a wide class of natural RPRs. While we prove that the computational problem of maximizing consistency is hard, we provide a sampling-based implementation that is efficient in practice. We run this implementation on known statistical models to experimentally demonstrate its desirable properties, as well as on real-world data where our method provides important insights into how to improve consistency.

Poster

P4-#4207

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

Bingji Yi ⋅ Qiyuan Liu ⋅ Yuwei Cheng ⋅ Haifeng Xu

Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model performance, a phenomenon often coined model collapse. In this paper, we investigate ways to modify the synthetic retraining process to avoid model collapse, and even possibly help reverse the trend from collapse to improvement. Our key finding is that by injecting information through an external synthetic data verifier, whether a human or a better model, synthetic retraining will not cause model collapse. Specifically, we situate our theoretical analysis in the fundamental linear regression setting, showing that verifier-guided retraining can yield near-term improvements, but ultimately drives the parameter estimate to the verifier's “knowledge center” in the long run. Our theory further predicts that, unless the verifier is perfectly reliable, these early gains will plateau and may even reverse. Indeed, our experiments across linear regression, Variational Autoencoders (VAEs) trained on MNIST, and fining-tuning SmolLM2-135M on the XSUM task confirm these theoretical insights.

Poster

P4-#4208

Overparametrization bends the landscape: BBP transitions at initialization in simple Neural Networks

Brandon Annesi ⋅ Dario Bocchi ⋅ Chiara Cammarota

High-dimensional non-convex loss landscapes play a central role in the theory of Machine Learning. Gaining insight into how these landscapes interact with gradient-based optimization methods, even in relatively simple models, can shed light on this enigmatic feature of neural networks. In this work, we will focus on a prototypical simple learning problem, which generalizes the Phase Retrieval inference problem by allowing the exploration of overparametrized settings. Using techniques from field theory, we analyze the spectrum of the Hessian at initialization and identify a Baik–Ben Arous–Péché (BBP) transition in the amount of data that separates regimes where the initialization is informative or uninformative about a planted signal of a teacher-student setup. Crucially, we demonstrate how overparameterization can "bend" the loss landscape, shifting the transition point, even reaching the information-theoretic weak-recovery threshold in the large overparameterization limit, while also altering its qualitative nature. We distinguish between continuous and discontinuous BBP transitions and support our analytical predictions with simulations, examining how they compare to the finite-N behavior. In the case of discontinuous BBP transitions strong finite-N corrections allow the retrieval of information at a signal-to-noise ratio (SNR) smaller than the predicted BBP transition. In these cases we provide estimates for a new lower SNR threshold that marks the point at which initialization becomes entirely uninformative.

Poster

P4-#4209

Good Allocations from Bad Estimates

Sílvia Casacuberta ⋅ Moritz Hardt

Conditional average treatment effect (CATE) estimation is the de facto gold standard for targeting a treatment to a heterogeneous population. The method estimates treatment effects up to an error $\epsilon > 0$ in each of $M$ different strata of the population, targeting individuals in decreasing order of estimated treatment effect until the budget runs out. In general, this method requires $O(M/\epsilon^2)$ samples. This is best possible if the goal is to estimate all treatment effects up to an $\epsilon$ error. In this work, we show how to achieve the same total treatment effect as CATE with only $O(M/\epsilon)$ samples for natural distributions of treatment effects. The key insight is that coarse estimates suffice for near-optimal treatment allocations. In addition, we show that budget flexibility can further reduce the sample complexity of allocation. Finally, we evaluate our algorithm on various real-world RCT datasets. In all cases, it finds nearly optimal treatment allocations with surprisingly few samples. Our work highlights the fundamental distinction between treatment effect estimation and treatment allocation: the latter requires far fewer samples.

Poster

P4-#4210

Why High-rank Neural Networks Generalize?: An Algebraic Framework with RKHSs

Yuka Hashimoto ⋅ Sho Sonoda ⋅ Isao Ishikawa ⋅ Masahiro Ikeda

We derive a new Rademacher complexity bound for deep neural networks using Koopman operators, group representations, and reproducing kernel Hilbert spaces (RKHSs). The proposed bound describes why the models with high-rank weight matrices generalize well. Although there are existing bounds that attempt to describe this phenomenon, these existing bounds can be applied to limited types of models. We introduce an algebraic representation of neural networks and a kernel function to construct an RKHS to derive a bound for a wider range of realistic models. This work paves the way for the Koopman-based theory for Rademacher complexity bounds to be valid for more practical situations.

Poster

P4-#4211

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao ⋅ Kaixuan Ji ⋅ Heyang Zhao ⋅ Tong Zhang ⋅ Quanquan Gu

Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for $f$-divergences with strongly convex $f$, to which reverse KL *does not* belong, we show that the sharp sample complexity $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with $f$-divergence regularization.

Poster

P4-#4212

Polynomial Convergence of Riemannian Diffusion Models

Xingyu Xu ⋅ Ziyi Zhang ⋅ Yorie Nakahira ⋅ Guannan Qu ⋅ Yuejie Chi

Diffusion generative models have demonstrated remarkable empirical success in the recent years and are now considered the state-of-the-art generative models in modern AI. These models consist of a forward process, which gradually diffuses the data distribution to a noise distribution spanning the whole space, and a backward process, which inverts this transformation to recover the data distribution from noise. Most of the existing literature assumes that the underlying space is Euclidean. However, in many practical applications, the data are constrained to lie on a submanifold of Euclidean space. Addressing this setting, de Bortoli et al. (2022) introduced Riemannian diffusion models and proved that using an exponentially small step size yields small sampling error in Wasserstein distance, provided the data distribution is smooth and strictly positive. In this paper, we prove that a polynomially small stepsize suffices to guarantee small sampling error in total variation distance, without any assumption on the smoothness or positivity of the data distribution. Our analysis only requires mild and standard curvature assumptions on the underlying manifold. The main ingredients in our proof are Li-Yau estimate for log-gradient of heat kernel, and Minakshisundaram-Pleijel parametrix expansion for perturbed heat equation. Our approach opens the door to a sharper analysis of diffusion models on non-Euclidean spaces.

Poster

P4-#4213

A Theoretical Analysis of Mamba’s Training Dynamics: Filtering Relevant Features for Generalization in State Space Models

Mugunthan Shandirasegaran ⋅ Hongkang Li ⋅ Songyang Zhang ⋅ Meng Wang ⋅ Shuai Zhang

The recent empirical success of Mamba and other selective state space models (SSMs) has renewed interest in non-attention architectures for sequence modeling, yet their theoretical foundations remain underexplored. We present a first-step analysis of generalization and learning dynamics for a simplified but representative Mamba block: a single-layer, single-head selective SSM with input-dependent gating, followed by a two-layer MLP trained via gradient descent (GD). Our study adopts a structured data model with tokens that include both class-relevant and class-irrelevant patterns under token-level noise and examines two canonical regimes: majority-voting and locality-structured data sequences. We prove that the model achieves guaranteed generalization by establishing non-asymptotic sample complexity and convergence rate bounds, which improve as the effective signal increases and the noise decreases. Furthermore, we show that the gating vector aligns with class-relevant features while ignoring irrelevant ones, thereby formalizing a feature-selection role similar to attention but realized through selective recurrence. Numerical experiments on synthetic data justify our theoretical results. Overall, our results provide principled insight into when and why Mamba-style selective SSMs learn efficiently, offering a theoretical counterpoint to Transformer-centric explanations.

Poster

P4-#4214

Why DPO is a Misspecified Estimator and How to Fix It

Aditya Gopalan ⋅ Sayak Ray Chowdhury ⋅ Debangshu Banerjee

Direct alignment algorithms such as Direct Preference Optimization (DPO) fine-tune models based on preference data, using only supervised learning instead of two-stage reinforcement learning with human feedback (RLHF). We show that DPO encodes a statistical estimation problem over reward functions induced by a parametric policy class. When the true reward function that generates preferences cannot be realized via the policy class, DPO becomes misspecified, resulting in failure modes such as preference order reversal, worsening of policy reward, and high sensitivity to the input preference data distribution. On the other hand, we study the local behavior of two-stage RLHF for a parametric class and relate it to a natural gradient step in policy space. Our fine-grained geometric characterization allows us to propose AuxDPO, which introduces additional auxiliary variables in the DPO loss function to help move towards the RLHF solution in a principled manner and mitigate the misspecification in DPO. We empirically demonstrate the superior performance of AuxDPO on didactic bandit settings as well as LLM alignment tasks.

Poster

P4-#4215

On the Spectral Differences Between NTK and CNTK and Their Implications for Point Cloud Recognition

yuanqu Mou ⋅ Chang Gou ⋅ Haiyang Bai ⋅ Jia Liu

The Convolutional Neural Tangent Kernel (CNTK) offers a principled framework for understanding convolutional architectures in the infinite-width regime. However, a comprehensive spectral comparison between CNTK and the classical Neural Tangent Kernel (NTK) remains underexplored. In this work, we present a detailed analysis of the spectral properties of CNTK and NTK, revealing that point cloud data exhibits a stronger alignment with the spectral bias of CNTK than images. This finding suggests that convolutional structures are inherently more suited to such geometric and irregular data formats. Based on this insight, we implement CNTK-based kernel regression for point cloud recognition tasks and demonstrate that it significantly outperforms NTK and other kernel baselines, especially in low-data settings. Furthermore, we derive a closed-form expression that connects CNTK with NTK in hybrid architectures. In addition, we introduce a closed-form of CNTK followed by NTK, while not the main focus, achieves strong empirical performance when applied to point-cloud tasks. Our study not only provides new theoretical understanding of spectral behaviors in neural tangent kernels but also shows that these insights can guide the practical design of CNTK-based regression for structured data such as point clouds.

Poster

P4-#4216

Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling

Jialin Liu ⋅ Lisang Ding ⋅ Stanley J Osher ⋅ Wotao Yin

Implicit models, an emerging model class, compute outputs by iterating a single parameter block to a fixed point. This architecture realizes an infinite-depth, weight-tied network that trains with constant memory, significantly reducing memory needs for the same level of performance compared to explicit models. While it is empirically known that these compact models can often match or even exceed the accuracy of larger explicit networks by allocating more test-time compute, the underlying reasons are not yet well understood. We study this gap through a non-parametric analysis of expressive power. We provide a strict mathematical characterization, showing that a simple and regular implicit operator can, through iteration, progressively express more complex mappings. We prove that for a broad class of implicit models, this process allows the model's expressive power to grow with test-time compute, ultimately matching a much richer function class. The theory is validated across four domains: imaging, scientific computing, operations research, and LLM reasoning, demonstrating that as test-time iterations increase, the complexity of the learned mapping rises, while the solution quality simultaneously improves and stabilizes.

Poster

P4-#4217

Understanding the Dynamics of Forgetting and Generalization in Continual Learning via the Neural Tangent Kernel

Guodong Zheng ⋅ Peng Wang ⋅ Shengchao Hu ⋅ Quan Zheng ⋅ Li Shen

Continual learning (CL) enables models to acquire new tasks sequentially while retaining previously learned knowledge. However, most theoretical analyses focus on simplified, converged models or restrictive data distributions and therefore fail to capture how forgetting and generalization evolve during training in more general settings. Current theory faces two fundamental challenges: (i) analyses confined to the converged regime cannot characterize intermediate training dynamics; and (ii) establishing forgetting bounds requires two-sided bounds on the population risk for each task. To address these challenges, we analyze the training-time dynamics of forgetting and generalization in standard CL within the Neural Tangent Kernel (NTK) regime, showing that decreasing the loss’s Lipschitz constant and minimizing the cross-task kernel jointly reduce forgetting and improve generalization. Specifically, we (i) characterize intermediate training stages via kernel gradient flow and (ii) employ Rademacher complexity to derive both upper and lower bounds on population risk. Building on these insights, we propose \emph{OGD+}, which projects the current task’s gradient onto the orthogonal complement of the subspace spanned by gradients of the most recent task evaluated on all prior samples. We further introduce \emph{Orthogonal Penalized Gradient Descent} (OPGD), which augments OGD+ with gradient-norm penalization to jointly reduce forgetting and enhance generalization. Experiments on multiple benchmarks corroborate our theoretical predictions and demonstrate the effectiveness of OPGD, providing a principled pathway from theory to algorithm design in CL.

Poster

P4-#4218

Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers

Peter Shaw ⋅ James Cohan ⋅ Jacob Eisenstein ⋅ Kristina Toutanova

The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning. However, its application to neural networks such as Transformers is challenging due to the lack of a principled, universal measure for model complexity. This paper introduces the theoretical notion of asymptotically optimal description length objectives, grounded in the theory of Kolmogorov complexity. We establish that a minimizer of such an objective achieves optimal compression, for any dataset, up to an additive constant, in the limit as model resource bounds increase. We prove that asymptotically optimal objectives exist for Transformers, building on a new demonstration of their computational universality. We further show that such objectives can be tractable and differentiable by constructing and analyzing a variational objective based on an adaptive Gaussian mixture prior. Our empirical analysis shows that this variational objective selects for a low-complexity solution with strong generalization on an algorithmic task, but standard optimizers fail to find such solutions from a random initialization, highlighting key optimization challenges. More broadly, by providing a theoretical framework for identifying description length objectives with strong asymptotic guarantees, we outline a potential path towards training neural networks that achieve greater compression and generalization.

Poster

P4-#4318

InfoNCE Induces Gaussian Distribution

Roy Betser ⋅ Eyal Gofer ⋅ Meir Yossef Levi ⋅ Guy Gilboa

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.

Poster

P4-#4317

Tight Bounds for Schrodinger Potential Estimation in Unpaired Data Translation

Nikita Puchkin ⋅ Denis Suchkov ⋅ Aleksei Naumov ⋅ Denis Belomestny

Modern methods of generative modelling and unpaired data translation based on Schrodinger bridges and stochastic optimal control theory aim to transform an initial density to a target one in an optimal way. In the present paper, we assume that we only have access to i.i.d. samples from the initial and final distributions. This makes our setup suitable for both generative modelling and unpaired data translation. Relying on the stochastic optimal control approach, we choose an Ornstein-Uhlenbeck process as the reference one and estimate the corresponding Schrodinger potential. Introducing a risk function as the Kullback-Leibler divergence between couplings, we derive tight bounds on the generalization ability of an empirical risk minimizer over a class of Schrodinger potentials, including Gaussian mixtures. Thanks to the mixing properties of the Ornstein-Uhlenbeck process, we almost achieve fast rates of convergence, up to some logarithmic factors, in favourable scenarios. We also illustrate the performance of the suggested approach with numerical experiments.

Poster

P4-#4316

Sublinear Spectral Clustering Oracle with Little Memory

Ranran Shen ⋅ Xiaoyi Zhu ⋅ Pan Peng ⋅ Zengfeng Huang

We study the problem of designing *sublinear spectral clustering oracles* for well-clusterable graphs. Such an oracle is an algorithm that, given query access to the adjacency list of a graph $G$, first constructs a compact data structure $\mathcal{D}$ that captures the clustering structure of $G$. Once built, $\mathcal{D}$ enables sublinear time responses to \textsc{WhichCluster}$(G,x)$ queries for any vertex $x$. A major limitation of existing oracles is that constructing $\mathcal{D}$ requires $\Omega(\sqrt{n})$ memory, which becomes a bottleneck for massive graphs and memory-limited settings. In this paper, we break this barrier and establish a memory-time trade-off for sublinear spectral clustering oracles. Specifically, for well-clusterable graphs, we present oracles that construct $\mathcal{D}$ using much smaller than $O(\sqrt{n})$ memory (e.g., $O(n^{0.01})$) while still answering membership queries in sublinear time. We also characterize the trade-off frontier between memory usage $S$ and query time $T$, showing, for example, that $S\cdot T=\widetilde{O}(n)$ for clusterable graphs with a logarithmic conductance gap, and we show that this trade-off is nearly optimal (up to logarithmic factors) for a natural class of approaches. Finally, to complement our theory, we validate the performance of our oracles through experiments on synthetic networks.

Journal Track Poster

P4-#4315

PCF Learned Sort: a Learning Augmented Sort Algorithm with O(nloglogn) Expected Complexity

Atsuki Sato · Yusuke Matsui

Sorting is one of the most fundamental algorithms in computer science. Recently, Learned Sorts, which use machine learning to improve sorting speed, have attracted attention. While existing studies show that Learned Sort is empirically faster than classical sorting algorithms, they do not provide theoretical guarantees about its computational complexity. We propose Piecewise Constant Function (PCF) Learned Sort, a theoretically guaranteed Learned Sort algorithm. We prove that the expected complexity of PCF Learned Sort is $\mathcal{O}(n \log \log n)$ under mild assumptions on the data distribution. We also confirm empirically that PCF Learned Sort has a computational complexity of $\mathcal{O}(n \log \log n)$ on both synthetic and real datasets. This is the first study to theoretically support the empirical success of Learned Sort, and provides evidence for why Learned Sort is fast.

Poster

P4-#4314

Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective

Zhihao Zhang ⋅ Qiaole Dong ⋅ Qi Zhang ⋅ Enyu Zhou ⋅ Jun Zhao ⋅ Zhiheng Xi ⋅ Senjie Jin ⋅ Xiaoran Fan ⋅ Yuhao Zhou ⋅ Mingqi Wu ⋅ Yanwei Fu ⋅ Tao Ji ⋅ Tao Gui ⋅ Xuanjing Huang ⋅ Kai Chen

Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt (multimodal) large language models to downstream tasks. While effective at task adaptation, their impact on retaining prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on the open-source Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but better maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model’s probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a smaller magnitude of influence and are better aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. We further validate our framework on Qwen2.5 post-training in math and scientific QA, observing consistent forgetting and learning-dynamics trends. These findings suggest that the distribution of post-training data, rather than algorithmic differences alone, plays a central role in forgetting, and highlight RFT as a promising ingredient for stable continual post-training.

Poster

P4-#4313

Matched Data, Better Models: Target Aligned Data Filtering with Sparse Autoencoders

Arnav Das ⋅ Gantavya Bhatt ⋅ Sahil Verma ⋅ Yiping Wang ⋅ Viswa Virinchi Muppirala ⋅ Jeff Bilmes

Data filtering plays a central role in improving model performance, particularly for vision language models that are pretrained on large, noisy, and redundant image-caption datasets. Existing filtering techniques assess every sample individually and retain those that exceed a certain quality threshold, but such strategies fail to capture higher-order interactions. In this work, we propose a novel submodular framework for data selection that addresses this limitation. Our method, Submodular Distribution Matching (SDM), selects a subset by: (1) training a type of sparse autoencoder to learn disentangled and \emph{monotone} features; (2) estimating a target feature distribution from a target dataset; and (3) selecting a subset of samples whose feature distribution closely matches the target via submodular maximization. Given the DataComp-medium training set and no external models, SDM achieves state-of-the-art accuracy on both ImageNet-1K and average performance across 38 downstream tasks. On the full DataComp-medium benchmark, SDM delivers performance within 1\% of the state-of-the-art results while using over \textbf{\emph{5×}} fewer GPU hours than the leading approach.

Poster

P4-#4312

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Mohammad Taufeeque ⋅ Aaron Tucker ⋅ Adam Gleave ⋅ Adrià Garriga-Alonso

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels, extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.

Poster

P4-#4311

Learning multimodal dictionary decompositions with group-sparse autoencoders

Chiraag Kaushik ⋅ Davis Barch ⋅ Andrea Fanelli

The Linear Representation Hypothesis asserts that the embeddings learned by neural networks can be understood as linear combinations of features corresponding to high-level concepts. Based on this ansatz, sparse autoencoders (SAEs) have recently become a popular method for decomposing embeddings into a sparse combination of linear directions, which have been shown empirically to often correspond to human-interpretable semantics. However, recent attempts to apply SAEs to multimodal embedding spaces (such as the popular CLIP embeddings for image/text data) have found that SAEs often learn ``split dictionaries,'' where most of the learned sparse features are essentially unimodal, active only for data of a single modality. In this work, we study how to effectively adapt SAEs for the setting of multimodal embeddings while ensuring multimodal alignment. We first argue that the existence of a split dictionary decomposition on an aligned embedding space implies the existence of a non-split dictionary with improved modality alignment. Then, we propose a new SAE-based approach to multimodal embedding decomposition using cross-modal random masking and group-sparse regularization. We apply our method to popular embeddings for image/text (CLIP) and audio/text (CLAP) data and show that, compared to standard SAEs, our approach learns a more multimodal dictionary while reducing the number of dead neurons and improving feature semanticity. We finally demonstrate how this improvement in alignment of concepts between modalities can enable improvements in the interpretability and control of cross-modal tasks.

Poster

P4-#4310

On the Expressiveness of State Space Models via Temporal Logics

Eric Alsmann ⋅ Lowejatan Noori ⋅ Martin Lange

We investigate the expressive power of state space models (SSM), which have recently emerged as a potential alternative to transformer architectures in large language models. Building on recent work, we analyse SSM expressiveness through fragments and extensions of linear temporal logic over finite traces. Our results show that the expressive capabilities of SSM vary substantially depending on the underlying gating mechanism. We further distinguish between SSM operating over fixed-width arithmetic (quantised models), whose expressive power remains within regular languages, and SSM with unbounded precision, which can capture counting properties and non-regular languages. In addition, we provide a systematic comparison between these different SSM variants and known results on transformers, thereby clarifying how the two architectures relate in terms of expressive power.

Poster

P4-#4309

Two (narrow) heads are better than (an arbitrarily wide) one

Amanuel Tesfaye ⋅ Zeno Kujawa ⋅ Rajmohan Rajaraman ⋅ Ravi Sundaram

In this paper, we establish a dimension- and precision-independent impossibility result for a simplified transformer model. Due to their size, a comprehensive understanding of the internal operations of frontier large language models (LLMs) is beyond the reach of current methods, but research into small and interpretable models has proven successful. We study the representational limits of attention, the core of transformer models, through the lens of the Endpoint Selection Problem (ESP), a simple yet expressive learning task defined over arcs of a directed graph. Our main theoretical results are twofold: (i) 1-head, 1-layer, attention-only transformers cannot solve ESP on any graph containing a cycle, even with unbounded dimension and precision; but, all DAGs (Directed Acyclic Graph) are solvable with zero error (ii) in contrast, a 2-head, 1-layer, attention-only transformer can solve ESP on arbitrary directed graphs with constant embedding dimension and logarithmic precision. Prior lower bounds were conditional on bounds on dimension and precision. Through a transformation, we extend our impossibility result from ESP to the much studied 2-hop induction head problem. Further, we uncover a surprising connection to NP-completeness by showing that the optimal error of the 1-head transformer is exactly related to the size of MAS (Maximum Acyclic Subgraph) and hence inapproximable. Finally, we validate our theory with experiments and observe that gradient-based optimization can reliably find 1-head solutions for DAGs and 2-head solutions for arbitrary graphs with cycles, whereas 1-head models struggle to reach the optimal solution in graphs with cycles. We believe that our techniques are of independent interest and have the potential to establish a new fine-grained hierarchy of transformer architectures, each with greater problem-solving power than the last.

Poster

P4-#4308

Efficient Estimation of Kernel Surrogate Models for Task Attribution

Zhenshuo Zhang ⋅ Minxuan Duan ⋅ Hongyang Zhang

Modern AI agents such as large language models are trained on diverse tasks---translation, code generation, mathematical reasoning, and text prediction---simultaneously. A key question is to quantify how each individual training task influences performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict a target task's performance for any subset of training tasks has emerged in recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships, but miss nonlinear interactions such as synergy, antagonism, or XOR-type effects. In this paper, we first consider a unified task weighting framework for analyzing task attribution methods, and show a new connection between linear surrogate models and influence functions through a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate surrogate estimates with less than 2% relative error without repeated retraining. Experiments across multiple domains---including mathematical reasoning in transformers, in-context learning, and multi-objective reinforcement learning---demonstrate the effectiveness of kernel surrogate models. They achieve a 25% higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines, enabling more accurate and scalable task attribution. When used for downstream task selection, kernel surrogate models further yield a 40% improvement in demonstration selection for in-context learning and multi-objective reinforcement learning benchmarks.

Poster

P4-#4307

Influence Dynamics and Stagewise Data Attribution

Jin Hwa Lee ⋅ Matthew Smith ⋅ Maxwell Adam ⋅ Jesse Hoogland

Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model's progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

Poster

P4-#4306

The Deleuzian Representation Hypothesis

Clément Cornet ⋅ Romaric Besançon ⋅ Hervé Le Borgne

We propose an alternative to sparse autoencoders (SAEs) as a simple and effective unsupervised method for extracting interpretable concepts from neural networks. The core idea is to cluster differences in activations, which we formally justify within a discriminant analysis framework. To enhance the diversity of extracted concepts, we refine the approach by weighting the clustering using the skewness of activations. The method aligns with Deleuze's modern view of concepts as differences. We evaluate the approach across five models and three modalities (vision, language, and audio), measuring concept quality, diversity, and consistency. Our results show that the proposed method achieves concept quality surpassing prior unsupervised SAE variants while approaching supervised baselines, and that the extracted concepts enable steering of a model’s inner representations, demonstrating their causal influence on downstream behavior.

Poster

P4-#4305

Strong Correlations Induce Cause Only Predictions in Transformer Training

Haihan Zhang ⋅ Yimu Zhang ⋅ Cong Fang

We revisit when Transformers can prioritize causes over spurious effects by viewing the problem through data correlation strength and the implicit regularization of gradient descent. We identify a phenomenon called Correlation Crowding-Out (CCO) arising from the training dynamics of Transformers. Specifically, under strongly correlated causal features, gradient descent filters out spurious cues and converges to a predictor that relies almost exclusively on the causes. Theoretically, using a simplified Transformer model trained on data from a minimal causal chain, we introduce a Dominant-coordinate condition that characterizes when CCO arises and explain its mechanism as a coupling of ''occupation'' and ''crowding-out''. ''Occupation'' denotes to rapid growth of weights aligned with the dominant causal direction while non-dominant directions remain small. ''Crowding-out'' denotes to attention logits align with separation directions favoring the causal branch, suppressing descendants. We provide convergence guarantees for both the optimization trajectory and generalization. Our empirical results on simulated and real examples across various tasks including vision and natural language demonstrate the procedure. Together, these results show that, under suitable conditions, standard training alone can induce cause only prediction.

Poster

P4-#4304

DADA: Dual Averaging with Distance Adaptation

Mohammad Moshtaghifar ⋅ Anton Rodomanov ⋅ Daniil Vankov ⋅ Sebastian Stich

We present a novel parameter-free universal gradient method for solving convex optimization problems. Our algorithm—Dual Averaging with Distance Adaptation (DADA)–is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on the observed gradients and the distance between its iterates to the starting point, without the need for knowing any problem-specific parameters. DADA is a universal algorithm that simultaneously works for a wide range of problem classes as long as one is able to bound the local growth of the objective around its minimizer. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, Hölder-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and (L0, L1)-smooth functions. Furthermore, in contrast to many existing methods, DADA is suitable not only for unconstrained problems, but also constrained ones, possibly with unbounded domain, and it does not require fixing neither the number of iterations nor the accuracy in advance.

Poster

P4-#4303

Decision-Theoretic Approaches for Improved Learning-Augmented Algorithms

Spyros Angelopoulos ⋅ Christoph Dürr ⋅ Georgii Melidi

We initiate the systematic study of decision-theoretic metrics in the design and analysis of algorithms with machine-learned predictions. We introduce approaches based on both deterministic measures such as distance-based evaluation, that help us quantify how close the algorithm is to an ideal solution, and stochastic measures that balance the trade-off between the algorithm's performance and the risk associated with the imperfect oracle. These approaches allow us to quantify the algorithm's performance across the full spectrum of the prediction error, and thus choose the best algorithm within an entire class of otherwise incomparable ones. We apply our framework to three well-known problems from online decision making, namely ski-rental, one-max search, and contract scheduling.

Poster

P4-#4302

Beyond Short Steps in Frank-Wolfe Algorithms

David Martinez-Rubio ⋅ Sebastian Pokutta

We introduce novel techniques to enhance Frank-Wolfe algorithms by leveraging function smoothness beyond traditional short steps. Our study focuses on Frank-Wolfe algorithms with step sizes that incorporate primal-dual guarantees, offering practical stopping criteria. We present a new Frank-Wolfe algorithm utilizing an optimistic framework and provide a primal-dual convergence proof. Additionally, we propose a generalized short-step strategy aimed at optimizing a computable primal-dual gap. Interestingly, this new generalized short-step strategy is also applicable to gradient descent algorithms beyond Frank-Wolfe methods. Empirical results demonstrate that our optimistic algorithm outperforms existing methods, highlighting its practical advantages.

Poster

P4-#4301

Single-Loop Byzantine-Resilient Federated Bilevel Optimization

Yangnan Li ⋅ Shenghui Song ⋅ Xuanyu Cao

Federated bilevel optimization plays a crucial role in solving complex problems with nested optimization structures. However, its distributed nature makes it highly susceptible to faulty or Byzantine behaviors. Existing Byzantine-resilient approaches are either restricted to simple single-level optimization problems or rely on sub-loop updates that introduce significant computational and communication overhead. To address these limitations, we propose a family of Byzantine-resilient federated bilevel algorithms, which (i) operate within a single-loop structure, (ii) achieve optimal Byzantine resilience, and (iii) ensure computational and communication efficiency. The core of the proposed method, BR-FedBi, leverages an auxiliary variable that facilitates efficient hypergradient estimation while simultaneously solving the lower- and upper-level problems. Building on BR-FedBi, we further integrate the algorithm with Polyak’s momentum and the probabilistic gradient estimator (PAGE) (Li et al., 2021), resulting in provable optimal Byzantine resilience and optimal sample complexity. Both theoretical analysis and empirical results demonstrate the superior performance of the proposed algorithms.

Poster

P4-#4401

Optimizing Data Augmentation through Bayesian Model Selection

Madi Matymov ⋅ Ba-Hien Tran ⋅ Michael Kampffmeyer ⋅ Markus Heinonen ⋅ Maurizio Filippone

Data Augmentation (DA) has become an essential tool to improve robustness and generalization of modern machine learning. However, when deciding on DA strategies it is critical to choose parameters carefully, and this can be a daunting task which is traditionally left to trial-and-error or expensive optimization based on validation performance. In this paper, we counter these limitations by proposing a novel framework for optimizing DA. In particular, we take a probabilistic view of DA, which leads to the interpretation of augmentation parameters as model (hyper)-parameters, and the optimization of the marginal likelihood with respect to these parameters as a Bayesian model selection problem. Due to its intractability, we derive a tractable ELBO, which allows us to optimize augmentation parameters jointly with model parameters. We provide extensive theoretical results on variational approximation quality, generalization guarantees, invariance properties, and connections to empirical Bayes. Through experiments on computer vision and NLP tasks, we show that our approach improves calibration and yields robust performance over fixed or no augmentation. Our work provides a rigorous foundation for optimizing DA through Bayesian principles with significant potential for robust machine learning.

Poster

P4-#4402

Action Chunking and Data Augmentation Yield Exponential Improvements in Behavior Cloning for Continuous Spaces

Thomas T. Zhang ⋅ Daniel Pfrommer ⋅ Chaoyi Pan ⋅ Nikolai Matni ⋅ Max Simchowitz

This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.

Poster

P4-#4404

Queue Length Regret Bounds for Contextual Queueing Bandits

Seoungbin Bae ⋅ Garyeong Kang ⋅ Dabeen Lee

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-$\varepsilon$, achieves a regret upper bound of $\widetilde{\mathcal{O}}(T^{-1/4})$. We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of $\mathcal{O}(\log^2 T)$. Lastly, we provide experimental results that validate our theoretical findings.

Poster

P4-#4405

Analysis of approximate linear programming solution to Markov decision problem with log barrier function

Donghwan Lee ⋅ Hyukjun Yang ⋅ Bumgeun Park

There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear naive, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.

Poster

P4-#4406

Replicable Reinforcement Learning with Linear Function Approximation

ERIC EATON ⋅ Marcel Hussing ⋅ Michael Kearns ⋅ Aaron Roth ⋅ Sikata Sengupta ⋅ Jessica Sorrell

Replication of experimental results has been a challenge faced by many scientific disciplines, including the field of machine learning. Recent work on the theory of machine learning has formalized replicability as the demand that an algorithm produce identical outcomes when executed twice on different samples from the same distribution. Provably replicable algorithms are especially interesting for reinforcement learning (RL), where algorithms are known to be unstable in practice. While replicable algorithms exist for tabular RL settings, extending these guarantees to more practical function approximation settings has remained an open problem. In this work, we make progress by developing replicable methods for linear function approximation in RL. We first introduce two efficient algorithms for replicable random design regression and uncentered covariance estimation, each of independent interest. We then leverage these tools to provide the first provably efficient replicable RL algorithms for linear Markov decision processes in both the generative model and episodic settings. Finally, we evaluate our algorithms experimentally and show how they can inspire more consistent neural policies.

Poster

P4-#4408

On the Reasoning Abilities of Masked Diffusion Language Models

Anej Svete ⋅ Ashish Sabharwal

Masked diffusion models (MDMs) for text offer a compelling alternative to traditional autoregressive language models. Parallel generation makes them efficient, but their computational capabilities and the limitations inherent in their parallelism remain largely unexplored. To this end, we characterize what types of reasoning problems MDMs can provably solve and how efficiently. We do this by connecting MDMs to the well-understood reasoning frameworks of chain of thought (CoT) and padded looped transformers (PLTs) in the finite-precision log-width setting: We show that MDMs and polynomially-padded PLTs are, in fact, equivalent in this setting, and that MDMs can solve all problems that CoT-augmented transformers can. Moreover, we showcase classes of problems (including regular languages) for which MDMs are inherently more efficient than CoT transformers, where parallel generation allows for substantially faster reasoning.

Poster

P4-#4409

The Expressive Limits of Diagonal SSMs for State-Tracking

Mehran Shakerinava ⋅ Behnoush Khavari ⋅ Siamak Ravanbakhsh ⋅ Sarath Chandar

State-Space Models (SSMs) have recently been shown to achieve strong empirical performance on a variety of long-range sequence modeling tasks while remaining efficient and highly-parallelizable. However, the theoretical understanding of their expressive power remains limited. In this work, we study the expressivity of input-Dependent Complex-valued Diagonal (DCD) SSMs on sequential state-tracking tasks. We show that single-layer DCD SSMs cannot express state-tracking of any non-Abelian group at finite precision. More generally, we show that $k$-layer DCD SSMs can express state-tracking of a group if and only if that group has a subnormal series of length $k$, with Abelian factors. That is, we identify the precise expressivity range of $k$-layer DCD SSMs within the solvable groups. Empirically, we find that multi-layer models often fail to learn state-tracking for non-Abelian groups, highlighting a gap between expressivity and learnability.

Blog Track Poster

P4-#4410

Defining and quantifying compositional structure

Eric Elmoznino ⋅ Guillaume Lajoie

Compositionality is thought to be crucial in human cognition and AI, but we lack a scientific understanding of what it is. What kind of data is compositionally structured? Can we mathematically quantify the amount and character of compositional structure? This blog post introduces a novel approach for doing so, building off of existing tools from algorithmic information theory that formalize notions of complexity and structure. The mathematical definition of compositionality that we'll come to is rigorous, precise, and general, and the hope is that it can inspire novel research directions in AI for uncovering compositional structure in natural data.

Poster

P4-#4411

Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

Yuxuan Tang ⋅ Yifan Feng

Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiway comparisons and top-$k$ rankings. We introduce \textit{Ranked Choice Preference Optimization} (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility-based and rank-based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provides principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.

Poster

P4-#4412

Trajectory Generation with Conservative Value Guidance for Offline Reinforcement Learning

Tieru Wang ⋅ Kunbao Wu ⋅ Guoshun Nan

Recent advances in offline reinforcement learning (RL) have led to the development of high-performing algorithms that achieve impressive results across standard benchmarks. However, many of these methods depend on increasingly complex planning architectures, which hinder their deployment in real-world settings due to high inference costs. To overcome this limitation, recent research has explored data augmentation techniques that offload computation from online decision-making to offline data preparation. Among these, diffusion-based generative models have shown potential in synthesizing diverse trajectories but incur significant overhead in training and data generation. In this work, we propose Trajectory Generation with Conservative Value Guidance (TGCVG), a novel trajectory-level data augmentation framework that integrates a high-performing offline policy with a learned dynamics model. To ensure that the synthesized trajectories are both high-quality and close to the original dataset distribution, we introduce a value-guided regularization during the training of the offline policy. This regularization encourages conservative action selection, effectively mitigating distributional shift during trajectory synthesis. Empirical results on standard benchmarks demonstrate that TGCVG not only improves the performance of state-of-the-art offline RL algorithms but also significantly reduces training and trajectory synthesis time. These findings highlight the effectiveness of value-aware data generation in improving both efficiency and policy performance.

Poster

P4-#4413

In-Context Compositional Q-Learning for Offline Reinforcement Learning

Qiushui Xu ⋅ Yuhao Huang ⋅ Yushu Jiang ⋅ Wenliang Zheng ⋅ Lei Song ⋅ Jinyu Wang ⋅ Jiang Bian

Accurate estimation of the Q-function is a central challenge in offline reinforcement learning. However, existing approaches often rely on a shared global Q-function, which is inadequate for capturing the compositional structure of tasks that consist of diverse subtasks. We propose In-context Compositional Q-Learning (ICQL), an offline RL framework that formulates Q-learning as a contextual inference problem and uses linear Transformers to adaptively infer local Q-functions from retrieved transitions without explicit subtask labels. Theoretically, we show that, under two assumptions---linear approximability of the local Q-function and accurate inference of weights from retrieved context---ICQL achieves a bounded approximation error for the Q-function and enables near-optimal policy extraction. Empirically, ICQL substantially improves performance in offline settings, achieving gains of up to 16.4\% on kitchen tasks and up to 8.8\% and 6.3\% on MuJoCo and Adroit tasks, respectively. These results highlight the underexplored potential of in-context learning for robust and compositional value estimation and establish ICQL as a principled and effective framework for offline RL.

Poster

P4-#4414

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Jiayu Chen ⋅ Le Xu ⋅ Wen-Tse Chen ⋅ Jeff Schneider

Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further propose a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our "RL + Search" framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art offline RL methods on twelve D4RL MuJoCo tasks and three challenging, stochastic tokamak control tasks.

Poster

P4-#4415

Koopman-Assisted Trajectory Synthesis: A Data Augmentation Framework for Offline Imitation Learning

Jin Wang ⋅ Pengcheng He ⋅ Ke Jiang ⋅ Xiaoyang Tan

Data augmentation plays a pivotal role in offline imitation learning (IL) by alleviating covariate shift, yet existing methods remain constrained. Single-step techniques frequently violate underlying system dynamics, whereas trajectory-level approaches are plagued by compounding errors or scalability limitations. Even recent Koopman-based methods typically function at the single-step level, encountering computational bottlenecks due to action-equivariance requirements and vulnerability to approximation errors. To overcome these challenges, we introduce Koopman-Assisted Trajectory Synthesis (KATS), a novel framework for generating complete, multi-step trajectories. By operating at the trajectory level, KATS effectively mitigates compounding errors. It leverages a state-equivariant assumption to ensure computational efficiency and scalability, while incorporating a refined generator matrix to bolster robustness against Koopman approximation errors. This approach enables a more direct and efficacious mechanism for distribution matching in offline IL. Extensive experiments demonstrate that KATS substantially enhances policy performance and achieves state-of-the-art (SOTA) results, especially in demanding scenarios with narrow expert data distributions.

Poster

P4-#4416

Masked Skill Token Training for Hierarchical Off-Dynamics Transfer

Zeyu Feng ⋅ Haiyan Yin ⋅ Yew-Soon Ong ⋅ Harold Soh

Generalizing policies across environments with altered dynamics remains a key challenge in reinforcement learning, particularly in offline settings where direct interaction or fine-tuning is impractical. We introduce Masked Skill Token Training (MSTT), a fully offline hierarchical RL framework that enables policy transfer using observation-only demonstrations. MSTT constructs a discrete skill space via unsupervised trajectory tokenization and trains a skill-conditioned value function using masked Bellman updates, which simulate dynamics shifts by selectively disabling skills. A diffusion-based trajectory generator, paired with feasibility-based filtering, enables the agent to execute valid, temporally extended actions without requiring action labels or access to the target environment. Our results in both discrete and continuous domains demonstrate the potential of mask-guided planning for robust generalization under dynamics shifts. To our knowledge, MSTT is the first work to explore masking as a mechanism for simulating and generalizing across off-dynamics environments. It marks a promising step toward scalable, structure-aware transfer and opens avenues to explore multi-goal conditioning, and extensions to more complex, real-world scenarios.

Poster

P4-#4417

MOBODY: Model-Based Off-Dynamics Offline Reinforcement Learning

Yihong Guo ⋅ Yu Yang ⋅ Pan Xu ⋅ Anqi Liu

We study off-dynamics offline reinforcement learning, where the goal is to learn a policy from offline source and limited target datasets with mismatched dynamics. Existing methods either penalize the reward or discard source transitions occurring in parts of the transition space with high dynamics shift. As a result, they optimize the policy using data from low-shift regions, limiting exploration of high-reward states in the target domain that do not fall within these regions. Consequently, such methods often fail when the dynamics shift is significant or the optimal trajectories lie outside the low-shift regions. To overcome this limitation, we propose MOBODY, a Model-Based Off-Dynamics Offline RL algorithm that optimizes a policy using learned target dynamics transitions to explore the target domain, rather than only being trained with the low dynamics-shift transitions. For the dynamics learning, built on the observation that achieving the same next state requires taking different actions in different domains, MOBODY employs separate action encoders for each domain to encode different actions to the shared latent space while sharing a unified representation of states and a common transition function. We further introduce a target Q-weighted behavior cloning loss in policy optimization to avoid out-of-distribution actions, which push the policy toward actions with high target-domain Q-values, rather than high source domain Q-values or uniformly imitating all actions in the offline dataset. We evaluate MOBODY on a wide range of MuJoCo and Adroit benchmarks, demonstrating that it outperforms state-of-the-art off-dynamics RL baselines as well as policy learning methods based on different dynamics learning baselines, with especially pronounced improvements in challenging scenarios where existing methods struggle.

Poster

P4-#4418

Multiplayer Nash Preference Optimization

Fang Wu ⋅ Xu Huang ⋅ Weihao Xuan ⋅ Zhiwei Zhang ⋅ Yijia Xiao ⋅ Frank Wan ⋅ Xiaomin Li ⋅ Bing Hu ⋅ Peng Xia ⋅ Jure Leskovec ⋅ Yejin Choi

Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models with human preferences. However, reward-based methods grounded in the Bradley–Terry assumption struggle to capture the nontransitivity and heterogeneity of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO that offer strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, introducing a single-opponent bias that fails to capture the full complexity of realistic preference structures. This work introduces Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Comprehensive empirical evaluation shows that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at~\url{https://github.com/smiles724/MNPO}.

Poster

P4-#4517

On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning

Yifan Zhang ⋅ Yifeng Liu ⋅ Rina Hughes ⋅ Yang Yuan ⋅ Quanquan Gu ⋅ Andrew Yao

Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). KL regularization is ubiquitous, yet the design surface, choice of KL direction (forward vs. reverse), normalization (normalized vs. unnormalized), and estimator ($k_1/k_2/k_3$), is scattered across the literature and often intertwined with off-policy estimation. We ask a focused question: under the off-policy setting, what weighting is required for each KL variant so that the surrogate we optimize yields the exact gradient of the intended KL-regularized objective? We answer this with a compact, unified derivation we call the Regularized Policy Gradient (\textbf{RPG}) view. RPG (i) unifies normalized and unnormalized KL variants and shows that the widely-used $k_3$ penalty is exactly the unnormalized KL; (ii) specifies conditions under which REINFORCE-style losses with stop-gradient are gradient-equivalent to fully differentiable surrogates; (iii) identifies and corrects an off-policy importance-weighting mismatch in GRPO's KL term; and (iv) introduces RPG-Style Clip, a truncated-importance-sampling step within RPG-REINFORCE that enables stable, off-policy policy-gradient training at scale. On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO. Notably, RPG is a \emph{stable and scalable} RL algorithm for LLM reasoning, realized via (a) a KL-correct objective, (b) truncated importance sampling, and (c) an iterative reference-policy update scheme.

Poster

P4-#4516

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Zeliang Zhang ⋅ Xiaodong Liu ⋅ Hao Cheng ⋅ Hao Sun ⋅ Chenliang Xu ⋅ Jianfeng Gao

Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and memory usage. While sliding-window cache strategies can bound memory, they disrupt long-context reasoning and degrade performance. We introduce Progressive Thought Encoding, a parameter-efficient fine-tuning method that enables LRMs to reason effectively under fixed-size caches. By progressively encoding intermediate reasoning into compact representations, our approach eliminates the need to backpropagate through full-cache rollouts, thereby reducing training-time memory usage, while maintaining constant memory during inference. Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, across six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3\% improvement over LoRA and +29.9\% over the baseline on average, with up to +23.4 absolute gains on AIME2024/2025 under tight cache budgets. These results demonstrate that Progressive Thought Encoding not only improves reasoning accuracy but also makes RL training of LRMs substantially more efficient and scalable under real-world memory constraints.

Poster

P4-#4515

Simplicial Embeddings Improve Sample Efficiency in Actor–Critic Agents

Johan S Obando Ceron ⋅ Walter Mayor ⋅ Samuel Lavoie ⋅ Scott Fujimoto ⋅ Aaron Courville ⋅ Pablo Samuel Castro

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

Poster

P4-#4514

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Luckeciano Carvalho Melo ⋅ Alessandro Abate ⋅ Yarin Gal

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30$\times$ improvement in sample efficiency over standard GRPO for LLM reasoning.

Poster

P4-#4513

APPLE: Toward General Active Perception via Reinforcement Learning

Tim Schneider ⋅ Cristiana de Farias ⋅ Roberto Calandra ⋅ Liming Chen ⋅ Jan Peters

Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment. For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) – a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems. We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics. Project page: https://timschneider42.github.io/apple

Poster

P4-#4512

QuRL: Low-Precision Reinforcement Learning for Efficient Reasoning

Yuhang Li ⋅ Reena Elangovan ⋅ Xin Dong ⋅ Priyadarshini Panda ⋅ Brucek Khailany

Reinforcement learning with verifiable rewards (RLVR) has become a trending paradigm for training reasoning large language models (LLMs). However, due to the autoregressive decoding nature of LLMs, the rollout process becomes the efficiency bottleneck of RL training, consisting of up to 70\% of the total training time. In this work, we propose Quantized Reinforcement Learning (QuRL) that uses a quantized actor for accelerating the rollout. We address two challenges in QuRL. First, we propose Adaptive Clipping Range (ACR) that dynamically adjusts the clipping ratio based on the policy ratio between the full-precision actor and the quantized actor, which is essential for mitigating long-term training collapse. Second, we identify the weight update problem, where weight changes between RL steps are extremely small, making it difficult for the quantization operation to capture them effectively. We mitigate this problem through the invariant scaling technique that reduces quantization noise and increases weight update. We evaluate our method with INT8 and FP8 quantization experiments on DeepScaleR and DAPO, and achieve 20% to 80% faster rollout during training.

Poster

P4-#4511

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang ⋅ Zhixuan Liu ⋅ Xiangtian Li ⋅ Chaochao Lu ⋅ yang chao

The dominant paradigm for training large reasoning models—combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)—is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a vast landscape of unverifiable tasks unaddressed. To overcome these limitations, we introduce Native Reasoning Training (NRT), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-correcting feedback loop where the model learns to \textit{think} in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

Poster

P4-#4510

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Jens Tuyls ⋅ Dylan Foster ⋅ Akshay Krishnamurthy ⋅ Jordan Ash

Reinforcement learning (RL) promises to expand the capabilities of language models, but it is unclear if current RL techniques promote the discovery of novel behaviors, or simply sharpen those already present in the base model. In this paper, we investigate the value of deliberate exploration---explicitly incentivizing the model to discover novel and diverse behaviors---and aim to understand how the knowledge in pre-trained models can guide this search. Our main finding is that exploration with a simple, principled, representation-based bonus derived from the pre-trained language model's hidden states significantly improves diversity and pass@k rates---both for post-training, and in a novel inference-time scaling setting we introduce. (1) For inference-time, exploration with representation-based diversity improves efficiency, consistently improving pass@k rates across a variety of models and reasoning tasks. For example, for Qwen-2.5-14b-Instruct we obtain over 50\% improvement in verifier efficiency on almost all considered tasks. (2) For post-training, we show that integrating this exploration strategy into an RL pipeline improves reasoning performance over that of the initial model and over standard RL post-training. For example, on AIME 2024, our post-trained Qwen-2.5-7b-Instruct's pass@80 matches the pass@256 of GRPO on the same model, demonstrating a 3x improvement in test-time sample efficiency. Overall, our findings suggest that deliberate exploration---with the right notion of diversity---is a practical path toward discovery of new behaviors beyond sharpening.

Poster

P4-#4509

Reinforcement Learning via Value Gradient Flow

Haoran Xu ⋅ Kaiwen Hu ⋅ Somayeh Sojoudi ⋅ Amy Zhang

We study behavior-regularized reinforcement learning (RL), where regularization toward a reference distribution (the dataset in offline RL or the base model in LLM RL finetuning) is essential to prevent value over-optimization caused by erroneous out-of-distribution extrapolation. Existing methods either rely on reparameterized policy gradient, which are difficult to scale to large generative models, or on reject sampling, which can be overly conservative when attempting to move beyond the behavior support. In this paper, we propose Value Gradient Flow (VGF), a scalable new paradigm for behavior-regularized RL. VGF casts behavior-regularized RL as an optimal transport problem that maps the reference distribution to the value-induced optimal policy distribution. We solve this transport problem via discrete gradient flow, where value gradients guide particles initialized from the reference distribution. Our analysis shows that VGF imposes regularization implicitly by controlling the transport budget. VGF eliminates explicit policy parameterization while remaining expressive and flexible, this enables adaptive test-time scaling by adjusting the transport budget. Extensive experiments demonstrate that VGF significantly outperforms prior methods, achieving state-of-the-art results on offline RL benchmarks (D4RL, OGBench) and LLM RL tasks.

Poster

P4-#4508

A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

Ying-Tu Chen ⋅ Wei Hung ⋅ Bing-Shu Wu ⋅ Zhang-Wei Hong ⋅ Ping-Chun Hsieh

Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach addresses this by training a single policy network conditioned on preference-weighted rewards. In this paper, we explore a novel algorithmic perspective: leveraging reward-free reinforcement learning (RFRL) for MORL. While RFRL has historically been studied independently of MORL, it learns optimal policies for any possible reward function, making it a natural fit for MORL's challenge of handling unknown user preferences. We propose using the RFRL's training objective as an auxiliary task to enhance MORL, enabling more effective knowledge sharing beyond the multi-objective reward function given at training time. To this end, we adapt a state-of-the-art RFRL algorithm to the MORL setting and introduce a preference-guided exploration strategy that focuses learning on relevant parts of the environment. Through extensive experiments and ablation studies, we demonstrate that our approach significantly outperforms the state-of-the-art MORL methods across diverse MO-Gymnasium tasks, achieving superior performance and data efficiency. This work provides the first systematic adaptation of RFRL to MORL, demonstrating its potential as a scalable and empirically effective solution to multi-objective policy learning.

Poster

P4-#4507

ContextIF: Enhancing Instruction-Following through Context Reward

Yule Zhong ⋅ Jiacheng Yao ⋅ Guoxiu He

While supervised fine-tuning (SFT) and preference learning (PL) are widely used to enhance the instruction-following ability of Large Language Models (LLMs), they often struggle to generalize to novel or complex instructions and may compromise the models' general capabilities. In-Context Learning (ICL) emerges as a promising alternative due to its strong generalization without modifying the model's parameters, but its effectiveness is constrained by the reliance on high-quality, manually curated demonstration pools. To overcome this limitation, we propose ContextIF, a reinforcement learning (RL) framework for automatic context generation. Guided by comprehensive context reward, ContextIF is optimized by Group Relative Policy Optimization (GRPO). It aims to generate precise constraint summaries and optimal context demonstrations tailored to given instructions, thereby improving the instruction-following performance of target LLMs. We evaluate ContextIF on multiple representative instruction-following benchmarks using popular open-source LLMs. Experimental results demonstrate that ContextIF achieves substantial performance gains over existing SFT and ICL methods, while also generalizing effectively to unseen constraint conditions. Moreover, ContextIF preserves the parameters and general capabilities of the target models, offering strong adaptability and scalability. Our code is available at https://github.com/ECNU-Text-Computing/ContextIF.

Poster

P4-#4506

Reference Grounded Skill Discovery

Seungeun (Ross) Rho ⋅ Aaron Trinh ⋅ Danfei Xu ⋅ Sehoon Ha

Scaling unsupervised skill discovery algorithms to high-DoF agents remains challenging. As dimensionality increases, the exploration space grows exponentially, while the manifold of meaningful skills remains limited. Therefore, semantic meaningfulness becomes essential to effectively guide exploration in high-dimensional spaces. In this work, we present **Reference-Grounded Skill Discovery (RGSD)**, a novel algorithm that grounds skill discovery in a semantically meaningful latent space using reference data. RGSD first performs contrastive pretraining to embed motions on a unit hypersphere, clustering each reference trajectory into a distinct direction. This grounding enables skill discovery to simultaneously involve both imitation of reference behaviors and the discovery of semantically related diverse behaviors. On a simulated SMPL humanoid with $359$-D observations and $69$-D actions, RGSD successfully imitates skills such as walking, running, punching, and sidestepping, while also discover variations of these behaviors. In downstream locomotion tasks, RGSD leverages the discovered skills to faithfully satisfy user-specified style commands and outperforms imitation-learning baselines, which often fail to maintain the commanded style.

Poster

P4-#3404

Demystifying The Mechanisms Behind Emergent Exploration in Goal-Conditioned RL

Mahsa Bastankhah ⋅ Grace Liu ⋅ Dilip Arumugam ⋅ Thomas L. Griffiths ⋅ Benjamin Eysenbach

In this work, we take a first step toward elucidating the mechanisms behind emergent exploration in unsupervised reinforcement learning. We study Single-Goal Contrastive Reinforcement Learning (SGCRL) (Liu et al., 2025), a self-supervised algorithm capable of solving challenging long-horizon goal-reaching tasks without external rewards or curricula. We combine theoretical analysis of the algorithm’s objective function with controlled experiments to understand what drives its exploration. We show that SGCRL maximizes implicit rewards shaped by its learned representations. These representations automatically modify the reward landscape to promote exploration before reaching the goal and exploitation thereafter. Our experiments also demonstrate that these exploration dynamics arise from learning low-rank representations of the state space rather than from neural network function approximation. Our improved understanding enables us to adapt SGCRL to perform safety-aware exploration.

Poster

P4-#4505

The Art of Scaling Reinforcement Learning Compute for LLMs

Devvrit Khatri ⋅ Lovish Madaan ⋅ Rishabh Tiwari ⋅ Rachit Bansal ⋅ Venkata Sai Surya Subramanyam Duvvuri ⋅ Manzil Zaheer ⋅ Inderjit Dhillon ⋅ David Brandfonbrener ⋅ Rishabh Agarwal

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

Poster

P4-#4504

RL Grokking Recipe: How Does RL Unlock and Transfer New Algorithms in LLMs?

Yiyou Sun ⋅ Yuhan Cao ⋅ Pohao Huang ⋅ Haoyue Bai ⋅ Hanna Hajishirzi ⋅ Nouha Dziri ⋅ Dawn Song

It remains an open question whether LLMs can acquire or generalize genuinely new reasoning strategies, beyond the sharpened skills encoded in their parameters during pre-training or post-training. To attempt to answer this debate, we introduce DELTA — Distributional Evaluation of Learnability and Transferrability in Algorithmic Coding, a controlled benchmark of synthetic coding problem families designed to probe two fundamental aspects: learnability—can LLMs, through reinforcement learning (RL), solve problem families where pretrained models exhibit failure with large enough attempts (pass@K=0)?—and transferability—if learnability happens, can such skills transfer systematically to out-of-distribution (OOD) test sets? Unlike prior public coding datasets, DELTA isolates reasoning skills through templated problem generators and introduces fully OOD problem families that demand novel strategies rather than tool invocation or memorized patterns. Our experiments reveal a striking grokking phase transition: after an extended period with near-zero reward, RL-trained models abruptly climb to near-perfect accuracy. To enable learnability on previously unsolvable problem families, we explore key training ingredients such as staged warm-up with dense rewards, experience replay, curriculum training, and verification-in-the-loop. Beyond learnability, we use DELTA to evaluate transferability or generalization along exploratory, compositional, and transformative axes, as well as cross-family transfer. Results show solid gains within families and for recomposed skills, but persistent weaknesses in transformative cases. DELTA thus offers a clean testbed for probing the limits of RL-driven reasoning and for understanding how models can move beyond existing priors to acquire new algorithmic skills.

Poster

P4-#4503

Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX

Waris Radji ⋅ Thomas Michel ⋅ Hector Piteau

Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable. While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution. We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research. Octax provides the JAX community with a long-awaited end-to-end GPU alternative to Atari games, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at massive scale on modern GPUs. Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators. We demonstrate Octax's capabilities by training RL agents across multiple games, showing significant improvements in training speed and scalability compared to existing solutions. The environment's modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation. Our open-source framework is available at https://github.com/riiswa/octax/.

Poster

P4-#4502

Jackpot: Align Actor-Policy Distribution for scalable and stable RL for LLM

Zhuoming Chen ⋅ Hongyi Liu ⋅ Yang Zhou ⋅ Haizhong Zheng ⋅ Beidi Chen

Reinforcement learning (RL) has become an increasingly important paradigm for improving large language models (LLMs) on alignment, reasoning, and coding tasks, yet it remains extremely costly. The majority of training time is spent on rollouts. Allowing actor and policy distributions to differ could unlock substantial scalability and efficiency benefits, such as supporting large-batch or asynchronous training, and even enabling a lightweight rollout model. However, existing importance sampling–based corrections for distribution mismatch suffer from an inherent trade-off between stability and training performance. To tackle this problem, we propose Jackpot, which leverages Optimal Budget Rejection Sampling to directly reduce the gap between actor and policy distributions. For efficiency and stability in practical training, We introduce an efficient probability estimation strategy based on Top-$K$ logits with batch bias correction, and designs a stabilized Jackpot-PPO loss that jointly accounts for both the importance sampling ratio and the trust-region constraint in PPO. Empirically, our method achieves stable improvements in large-batch and asynchronous training, and in extreme off-policy training it substantially delays the onset of collapse and delivers competitive performance. Specifically, we achieve 20\% improvement on AMC benchmarks and ~8\% AIME benchmarks over the off-policy baseline under 128$\times$ actor-policy update ratio for Qwen3-4B-Base and 64$\times$ for Qwen3-8B-Base, while achieving greater stability and better performance than prior off-policy RL methods under extreme settings.

Poster

P4-#4501

Bridging the performance-gap between target-free and target-based reinforcement learning

Théo Vincent ⋅ Yogesh Tripathi ⋅ Tim Faust ⋅ Abdullah Akgül ⋅ Yaniv Oren ⋅ Melih Kandemir ⋅ Jan Peters ⋅ Carlo D Eramo

The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated $Q$-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared $Q$-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems while using a single $Q$-network, thus stepping towards resource-efficient reinforcement learning algorithms.

Poster

P4-#4601

Policy Newton Algorithm in Reproducing Kernel Hilbert Space

Yixian Zhang ⋅ Huaze Tang ⋅ Changxu Wei ⋅ Chao Wang ⋅ Wenbo Ding

Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton's method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent, computationally tractable finite-dimensional problem whose dimensionality scales with the trajectory data volume. We establish theoretical guarantees proving convergence to a local optimum with a local quadratic convergence rate. Empirical evaluations on a toy financial asset allocation problem validate these theoretical properties, while experiments on standard RL benchmarks demonstrate that Policy Newton in RKHS achieves superior convergence speed and higher episodic rewards compared to established first-order RKHS approaches and parametric second-order methods. Our work bridges a critical gap between non-parametric policy representations and second-order optimization methods in reinforcement learning.

Poster

P4-#4602

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora ⋅ R Devon Hjelm ⋅ Omar Attia ⋅ Alexander Toshev ⋅ Bogdan Mazoure

Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield black-box models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the MuJoCo, BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

Poster

P4-#4603

Learning to Answer from Correct Demonstrations

Nirmit Joshi ⋅ Gene Li ⋅ Siddharth Bhandari ⋅ Shiva Kasiviswanathan ⋅ Cong Ma ⋅ Nathan Srebro

We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). We formalize the problem as imitation learning (i.e., apprenticeship learning) in contextual bandits, with offline demonstrations from some expert (optimal, or very good) policy, without explicitly observed rewards. In contrast to prior work, which assumes the demonstrator belongs to a bounded-complexity policy class, we propose relying only on the underlying reward model (i.e., specifying which answers are correct) being in a bounded-complexity class, which we argue is a strictly weaker assumption. We show that likelihood-maximization methods can fail in this setting, and instead present an approach that learns to answer nearly as well as the demonstrator, with sample complexity logarithmic in the cardinality of the reward class. Our method is similar to Syed and Schapire 2007, when adapted to a contextual bandit (i.e., single step) setup, but is a simple one-pass online approach that enjoys an ``optimistic rate'' (i.e., $1/\varepsilon$ when the demonstrator is optimal, versus $1/\varepsilon^2$ in Syed and Schapire 2007, and works even with arbitrarily adaptive demonstrations.

Poster

P4-#4604

Heterogeneous Agent Q-weighted Policy Optimization

Bor Jiun Lin ⋅ Chun-Yi Lee

Multi-agent reinforcement learning (MARL) confronts a fundamental tension between stability and expressiveness. Stability requires avoiding divergence under non-stationary updates, while expressiveness demands capturing multimodal strategies for heterogeneous coordination. Existing methods sacrifice one for the other: value-decomposition and trust-region approaches ensure stability but assume restrictive unimodal policies, while expressive generative models lack optimization guarantees. To address this challenge, we introduce Heterogeneous Agent Q-weighted Policy Optimization (HAQO), a framework unifying sequential advantage-aware updates, Q-weighted variational surrogates, and entropy regularization. Our analysis establishes monotone improvement guarantees under bounded critic bias, extending trust-region theory to diffusion-based policies with intractable log-likelihoods. HAQO achieves superior returns and reduced variance compared to policy-gradient baselines across diverse benchmarks. The ablation studies confirm sequential updates ensure stability, expressive policies enable multimodality, and entropy regularization prevents collapse. HAQO reconciles stability and expressiveness in MARL with theoretical rigor and practical effectiveness.

Poster

P4-#4605

MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow

Ziyue Wang ⋅ Junde Wu ⋅ Linghan Cai ⋅ Chang Low ⋅ Xihong Yang ⋅ Qiaxuan Li ⋅ Yueming Jin

Modern clinical diagnosis relies on the comprehensive analysis of multi-modal patient data, drawing on medical expertise to ensure systematic and rigorous reasoning. Recent advances in Vision–Language Models (VLMs) and agent-based methods are reshaping medical diagnosis by effectively integrating multi-modal information. However, they often output direct answers and empirical-driven conclusions without clinical evidence supported by quantitative analysis, which compromises their reliability and hinders clinical usability. Here we propose MedAgent-Pro, an agentic reasoning paradigm that mirrors modern diagnosis principles via a hierarchical diagnostic workflow, consisting of disease-level standardized plan generation and patient-level personalized step-by-step reasoning. To support disease-level planning, a retrieval-augmented generation agent is designed to access medical guidelines for alignment with clinical standards. For patient-level reasoning, MedAgent-Pro leverages professional tools such as visual models to take various actions to analyze multi-modal input, and performs evidence-based reflection to iteratively adjust memory, enforcing rigorous reasoning throughout the process. Extensive experiments across a wide range of anatomical regions, imaging modalities, and diseases demonstrate the superiority of MedAgent-Pro over mainstream VLMs, agentic systems and leading expert models. Ablation studies and expert evaluation further confirm its robustness and clinical relevance. Anonymized code link is available in the reproducibility statement.

Poster

P4-#4606

Language and Experience: A Computational Model of Social Learning in Complex Tasks

Cédric Colas ⋅ Tracey Mills ⋅ Ben Prystawski ⋅ Michael Tessler ⋅ Noah Goodman ⋅ Jacob Andreas ⋅ Joshua B Tenenbaum

The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments. How do people integrate these two sources of knowledge, and how might AI systems? We present a computational framework that models human social learning as joint probabilistic inference over structured, executable world models given sensorimotor and linguistic data. We make this possible by turning a pretrained language model into a probabilistic model of how humans share advice conditioned on their beliefs, allowing our agents both to generate advice for others and to interpret linguistic input as evidence during Bayesian inference. Using behavioral experiments and simulations across 10 video games, we show how linguistic guidance can shape exploration and accelerate learning by reducing risky interactions and speeding up key discoveries in both humans and models. We further explore how knowledge can accumulate across generations through iterated learning experiments and demonstrate successful knowledge transfer between humans and models—revealing how structured, language-compatible representations might facilitate human-machine collaborative learning.

Poster

P4-#4607

From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning

Cheng Yang ⋅ Jiaxuan Lu ⋅ Haiyuan Wan ⋅ Junchi Yu ⋅ Feiwei Qin

The chemical reaction recommendation is to select proper reaction condition parameters for chemical reactions, which is pivotal to accelerating chemical science.With the rapid development of large language models (LLMs), there is growing interest in leveraging their reasoning and planning capabilities for reaction condition recommendation.Despite their success, existing methods rarely explain the rationale behind the recommended reaction conditions, limiting their utility in high-stakes scientific workflows. In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task. ChemMAS decomposes the task into mechanistic grounding, multi-channel recall, constraint-aware agentic debate, and rationale aggregation. Each decision is backed by interpretable justifications grounded in chemical knowledge and retrieved precedents. Experiments show that ChemMAS achieves 20–35\% gains over domain-specific baselines and outperforms general-purpose LLMs by 10–15\% in Top-1 similarity, while offering falsifiable, human-trustable rationales, which establishes a new paradigm for explainable AI in scientific discovery.

Poster

P4-#4608

MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

Huining Yuan ⋅ Zelai Xu ⋅ Zheyue Tan ⋅ Xiangmin Yi ⋅ Mo Guang ⋅ Kaiwen Long ⋅ Haojia Hui ⋅ BOXUN LI ⋅ Xinlei Chen ⋅ Bo Zhao ⋅ Xiao-Ping Zhang ⋅ Chao Yu ⋅ Yu Wang

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce **MARSHAL**, an end-to-end RL framework that incentivizes **M**ulti-**A**gent **R**easoning through **S**elf-play wit**H** str**A**tegic **L**LMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to $28.7$\% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to $10.0$\% on AIME, $7.6$\% on GPQA-Diamond, and $3.5$\% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

Poster

P4-#4609

Matching Multiple Experts: On the Exploitability of Multi-Agent Imitation Learning

Antoine Bergerault ⋅ Volkan Cevher ⋅ Negar Mehr

Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $\epsilon_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(n\epsilon_{\text{BC}}/(1-\gamma)^2\right)$ for a discount factor $\gamma$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.

Poster

P4-#4610

GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent System

Yiqin Yang ⋅ Xu Yang ⋅ Yuhua Jiang ⋅ Ni Mu ⋅ Hao Hu ⋅ Runpeng Xie ⋅ Ziyou Zhang ⋅ Siyuan Li ⋅ Yuan-Hua Ni ⋅ Qianchuan Zhao ⋅ Bo XU

In the realm of multi-agent systems, the challenge of partial observability is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.

Poster

P4-#4611

Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Chengzhi Liu ⋅ Yuzhe YANG ⋅ Kaiwen Zhou ⋅ Zhen Zhang ⋅ Yue Fan ⋅ Yanan Xie ⋅ Peng Qi ⋅ Xin Wang

The promotion of academic papers has become an important means of enhancing research visibility. where the appeal of dissemination largely determines its effectiveness. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: there is no way to improve it when you cannot evaluate it right. To address this, we introduce EvoPresent, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is PresAesth, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce EvoPresent Benchmark, a comprehensive benchmark comprising: Presentation Generation Quality, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and Aesthetic Awareness, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

Poster

P4-#4612

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Naoki Shitanda ⋅ Motoki Omura ⋅ Tatsuya Harada ⋅ Takayuki Osa

Scaling reinforcement learning to tens of thousands of parallel environments requires overcoming the limited exploration capacity of a single policy. Ensemble-based policy gradient methods, which employ multiple policies to collect diverse samples, have recently been proposed to promote exploration. However, merely broadening the exploration space does not always enhance learning capability, since excessive exploration can reduce exploration quality or compromise training stability. In this work, we theoretically analyze the impact of inter-policy diversity on learning efficiency in policy ensembles, and propose Coupled Policy Optimization which regulates diversity through KL constraints between policies. The proposed method enables effective exploration and outperforms strong baselines such as SAPG, PBT, and PPO across multiple tasks, including challenging dexterous manipulation, in terms of both sample efficiency and final performance. Furthermore, analysis of policy diversity and effective sample size during training reveals that follower policies naturally distribute around the leader, demonstrating the emergence of structured and efficient exploratory behavior. Our results indicate that diverse exploration under appropriate regulation is key to achieving stable and sample-efficient learning in ensemble policy gradient methods. Project page at https://naoki04.github.io/paper-cpo/ .

Poster

P4-#4613

RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling

Yang Liu ⋅ Jiaqi Li ⋅ Zilong Zheng

Rule-based reasoning is acknowledged as one of the fundamental problems of reasoning. While recent studies show that large reasoning models (LRMs) have remarkable reasoning capabilities enhanced by reinforcement learning (RL), real applications still face severe challenges due to variations in rule formats, types, and complexity. To mitigate this issue, we introduce RuleReasoner, an effective method for rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach in RL. Specifically, RuleReasoner resamples each training batch by updating the domain weights based on historical rewards. This facilitates domain balance and active learning schedules for RL, obviating static mix-training engineered by humans. Evaluations of in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin ($\Delta$4.1% on eight ID tasks and $\Delta$10.4% on three OOD benchmarks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior methods.

Poster

P4-#4614

EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems

Yufei He ⋅ Juncheng Liu ⋅ Yue Liu ⋅ Yibo Li ⋅ Tri Cao ⋅ Zhiyuan Hu ⋅ Xinxing Xu ⋅ Bryan Hooi

A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like “clever but clueless interns” in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients—by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state–action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.

Poster

P4-#4615

ATGen: Adversarial Reinforcement Learning for Test Case Generation

Qingyao Li ⋅ Xinyi Dai ⋅ Weiwen Liu ⋅ Xiangyang Li ⋅ Yasheng Wang ⋅ Ruiming Tang ⋅ Yong Yu ⋅ Weinan Zhang

Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a “fixed-difficulty ceiling”, fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGEN, a framework that trains a test case generator via adversarial reinforcement learning. ATGEN pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty that continuously challenges the current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximize “Output Accuracy” and “Attack Success”, enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGEN significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.

Poster

P4-#4616

LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision

Jundong Xu ⋅ Hao (Scofield) Fei ⋅ Huichi Zhou ⋅ Xin Quan ⋅ Qijun Huang ⋅ Shengqiong Wu ⋅ William Wang ⋅ Mong-Li Lee ⋅ Wynne Hsu

Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. The code and data are available at https://llm-symbol.github.io/LogicReward.

Poster

P4-#4617

GEM: A Gym for Generalist LLMs

Zichen Liu ⋅ Anya Sims ⋅ Keyu Duan ⋅ Changyu Chen ⋅ Simon Yu ⋅ Xiangxin Zhou ⋅ Haotian Xu ⋅ Shaopan Xiong ⋅ Bo Liu ⋅ Chenmien Tan ⋅ Weixun Wang ⋅ Hao Zhu ⋅ Weiyan Shi ⋅ Diyi Yang ⋅ Michael Qizhe Shieh ⋅ Yee Whye Teh ⋅ Wee Sun Lee ⋅ Min Lin

The training paradigm for large language models (LLMs) is moving from static datasets to experience-based learning, where agents acquire skills via interacting with complex environments. To facilitate this transition we introduce GEM (General Experience Maker), an open-source environment simulator designed for the age of LLMs. Analogous to OpenAI-Gym for traditional reinforcement learning (RL), GEM provides a standardized framework for the environment-agent interface, including asynchronous vectorized execution for high throughput, and flexible wrappers for easy extensibility. GEM also features a diverse suite of environments, robust integrated tools, and single-file example scripts demonstrating using GEM with five popular RL training frameworks. Along with this, we also provide a set of baselines across 24 environments using REINFORCE with Return Batch Normalization (ReBN), which---unlike GRPO---is compatible with the full RL setting of dense per-turn rewards and arbitrary discount factors. We further conduct apple-to-apple benchmarking of PPO, GRPO and REINFORCE in both single- and multi-turn settings using GEM to shed light on the algorithmic designs. GEM also functions as a convenient evaluation toolkit besides a training environment. We hope this framework can help accelerate future agentic LLM research.

Poster

P4-#4618

BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

Qianli Shen ⋅ Daoyuan Chen ⋅ Yilun Huang ⋅ Zhenqing Ling ⋅ Yaliang Li ⋅ Bolin Ding ⋅ Jingren Zhou

Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce \textbf{BOTS}, a unified framework for \textbf{B}ayesian \textbf{O}nline \textbf{T}ask \textbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates \emph{explicit evidence} from direct evaluations of selected tasks and \emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation for task selection. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

Poster

P4-#4718

Geometric-Mean Policy Optimization

Yuzhong Zhao ⋅ Yue Liu ⋅ Junpeng Liu ⋅ Jingye Chen ⋅ xun wu ⋅ Yaru Hao ⋅ Tengchao Lv ⋅ Shaohan Huang ⋅ Lei Cui ⋅ Qixiang Ye ⋅ Fang Wan ⋅ Furu Wei

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. GMPO is plug-and-play—simply replacing GRPO's arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible—analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that \Ours-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO and verl.

Poster

P4-#4717

KL-Regularized Reinforcement Learning for Generative Modelling is Designed to Mode Collapse

Anthony GX-Chen ⋅ Jatin Prakash ⋅ Jeff Guo ⋅ Rob Fergus ⋅ Rajesh Ranganath

Classical intuitions cast minimizing reverse KL as "mode seeking" and forward KL as "mass covering". In KL-regularized reinforcement learning, however, the regularizer determines both the target distribution's shape and the divergence being implicitly minimized, making its role more nuanced than simply inducing generic mode-seeking or mass-covering behaviour. Specifically, the target distribution is defined jointly by the reward function, the reference model, the type of regularizer, and the regularization strength. We show that under common settings—such as low regularization strength and equal verifiable rewards—both forward and reverse KL regularization tend to specify target distributions whose mass concentrates on a single high-reward region. Thus, the objective itself by construction induces diversity collapse, regardless of the policy optimization algorithm used. Building on this perspective, we introduce a simple and scalable modification that rescales rewards to induce target distributions assigning substantial probability across all high-reward regions. This yields a principled objective that maintains high solution quality while achieving broad reward-mode coverage. Empirically, this approach improves post-training diversity and performance for Large Language Models and Chemical Language Models, and is effective with either forward or reverse KL regularization, while using either naively fails.

Poster

P4-#4716

CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs

Yongcheng Zeng ⋅ Zexu Sun ⋅ Bokai Ji ⋅ Erxue Min ⋅ Hengyi Cai ⋅ Shuaiqiang Wang ⋅ Dawei Yin ⋅ Haifeng Zhang ⋅ Xu Chen ⋅ Jun Wang

Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by $\textbf{+3.30}$ points and $\textbf{+4.82}$ points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by $\textbf{+2.12}$ points on average across eight math reasoning benchmarks. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.

Poster

P4-#4715

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

Donghao Li ⋅ Chengshuai Shi ⋅ Weijuan Ou ⋅ Cong Shen ⋅ Jing Yang

Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection - efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.

Poster

P4-#4714

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

Xuan Zhang ⋅ Ruixiao Li ⋅ Zhijian Zhou ⋅ Long Li ⋅ Yulei Qin ⋅ Ke Li ⋅ Xing Sun ⋅ Xiaoyu Tan ⋅ Chao Qu ⋅ Yuan Qi

Reinforcement Learning (RL) has become a compelling way to strengthen the multi step reasoning ability of Large Language Models (LLMs). However, prevalent RL paradigms still lean on sparse outcome-based rewards and limited exploration, which often drives LLMs toward repetitive and suboptimal reasoning patterns. In this paper, we study the central question of how to design exploration for LLM reasoning and introduce MERCI (Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards), a novel RL algorithm that augments policy optimization with a principled intrinsic reward. Building on the idea of count-based exploration, MERCI leverages a lightweight Coin Flipping Network (CFN) to estimate the pseudo count and further epistemic uncertainty over reasoning trajectories, and converts them into an intrinsic reward that values novelty while preserving the learning signal from task rewards. We integrate MERCI into some advanced RL frameworks like Group Relative Policy Optimization (GRPO). Experiments on complex reasoning benchmarks demonstrate that MERCI encourages richer and more varied chains of thought, significantly improves performance over strong baselines, and helps the policy escape local routines to discover better solutions. It indicates that our targeted intrinsic motivation can make exploration reliable for language model reasoning.

Poster

P4-#4713

Sparse but Critical: A Token-Level Analysis of Distributional Shifts in RLVR Fine-Tuning of LLMs

Haoming Meng ⋅ Kexin Huang ⋅ Shaohang Wei ⋅ Chiyu Ma ⋅ Shuo Yang ⋅ xue wang ⋅ Guoyin Wang ⋅ Bolin Ding ⋅ Jingren Zhou

Reinforcement learning with verifiable rewards (RLVR) has significantly improved reasoning in large language models (LLMs), yet the token-level mechanisms underlying these improvements remain unclear. We present a systematic empirical study of RLVR’s distributional effects organized around three main analyses: (1) token-level characterization of distributional shifts between base and RL models, (2) the impact of token-level distributional shifts on reasoning performance through cross-sampling interventions, and (3) fine-grained mechanics of these shifts at the token level. We find that RL fine-tuning induces highly sparse and targeted changes, with only a small fraction of token distributions exhibiting meaningful divergence. We further characterize the structure of these shifts through analyses of token entropy, positional concentration, and reallocation of probability mass. To assess the functional importance of these sparse changes, we conduct cross-sampling experiments that selectively swap token choices between the base and RL models. Inserting only a small fraction of RL-sampled tokens into base generations progressively recovers RL performance gains, while injecting a similarly small number of base token choices into RL-generated responses collapses performance to base levels, isolating a sparse set of token-level decisions directly responsible for RLVR’s improvements. Finally, we explore divergence-weighted variants of the advantage signal as a diagnostic intervention, finding that they can yield improvements over baselines. Together, our results shed light on the distributional changes induced by RLVR and provide a fine-grained, token-level lens for understanding RLVR as a targeted refinement process.

Blog Track Poster

P4-#4712

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

Bingxiang He ⋅ Zekai Qu ⋅ Zeyuan Liu ⋅ Yinghao Chen ⋅ Yuxin Zuo ⋅ Cheng Qian ⋅ Kaiyan Zhang ⋅ Weize Chen ⋅ Chaojun Xiao ⋅ Ganqu Cui ⋅ Ning Ding ⋅ Zhiyuan Liu

Training small reasoning models with RL has become a race toward complexity, using multi-stage pipelines, dynamic schedules, and curriculum learning. We ask whether this complexity necessary? We show that JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline.

Poster

P4-#4711

OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

Zhiting Fan ⋅ Ruizhe Chen ⋅ Tianxiang Hu ⋅ Ru Peng ⋅ Zenan Huang ⋅ Haokai Xu ⋅ Yixin Chen ⋅ JIAN Wu ⋅ Junbo Zhao ⋅ Zuozhu Liu

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data that imparts problem-solving capabilities. However, as applications expand, high-quality SFT data in knowledge-intensive verticals (e.g., humanities and social sciences, medicine, law, finance) is exceedingly scarce: expert curation is costly, privacy constraints are strict, and label consistency is hard to guarantee. Recent work turns to synthetic data, typically prompting a teacher model over domain documents and filtering with handcrafted rubrics. Yet, rubric design is expert-dependent and rarely transfers across domains; moreover, prevalent heuristic optimization follows a brittle loop (write rubric $\rightarrow$ synthesize $\rightarrow$ train $\rightarrow$ inspect $\rightarrow$ guess tweaks) that lacks reliable, quantitative feedback about a rubric's true contribution to downstream performance. We argue for assessing synthetic data quality through its causal impact on the target model, using this feedback to guide data generation. Inspired by classic influence functions, we repurpose an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to the objective of a given target model on specific tasks. Our analysis reveals a gap: although synthetic and real samples may be close in embedding space, their influence on learning can differ substantially. Building on this insight, we propose an optimization-based synthetic data framework that adapts rubrics with target-model feedback. Instead of manually engineering domain rubrics, we supply lightweight guiding text and delegate rubric generation to a rubric-specialized model conditioned on the task; crucially, rubric (and data) selection is supervised by estimated downstream impact rather than proxy formality. Empirically, the framework yields consistent gains across domains (HSS and health), target models (e.g., Qwen and Llama families), and data generators, demonstrating broad generalization and engineering portability without task-specific tuning.

Poster

P4-#4710

Is On-Policy Data always the Best Choice for Direct Preference Optimization-Based LM Alignment?

Zetian Sun ⋅ dongfang li ⋅ Xuhui Chen ⋅ Baotian Hu ⋅ Min Zhang

The alignment of language models (LMs) with human preferences is critical for building reliable AI systems. The problem is typically framed as optimizing an LM policy to maximize the expected reward that reflects human preferences. Recently, Direct Preference Optimization (DPO) was proposed as an LM alignment method that directly optimizes the policy from static preference data, and further improved by incorporating on-policy sampling (i.e., preference candidates generated during the training loop) for better LM alignment. However, we show on-policy data is not always optimal, with systematic effectiveness difference emerging between static and on-policy preference candidates. For example, on-policy data can result in a $3\times$ effectiveness compared with static data for Llama-3, and a $0.4\times$ effectiveness for Zephyr. To explain the phenomenon, we propose the alignment stage assumption, which divides the alignment process into two distinct stages: the preference injection stage, which benefits from diverse data, and the preference fine-tuning stage, which favors high-quality data. Through theoretical and empirical analysis, we characterize these stages and propose an effective algorithm to identify the boundaries between them. We perform experiments on $5$ models (Llama, Zephyr, Phi-2, Qwen, Pythia) and $2$ alignment methods (DPO, SLiC-HF) to show the generalizability of alignment stage assumption and boundary measurement.

Poster

P4-#4709

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh ⋅ Majid Daliri ⋅ Majid Hadian ⋅ Vahab Mirrokni

Vector quantization, a problem rooted in Shannon's source coding theory, aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure. We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion, overcoming limitations of existing methods that fail to achieve optimal distortion rates. Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates (within a small constant factor) across all bit-widths and dimensions. TurboQuant achieves this by randomly rotating input vectors, inducing a concentrated Beta distribution on coordinates, and leveraging the near-independence property of distinct coordinates in high dimensions to simply apply optimal scalar quantizers per each coordinate. Recognizing that MSE-optimal quantizers introduce bias in inner product estimation, we propose a two-stage approach: applying an MSE quantizer followed by a 1-bit Quantized JL (QJL) transform on the residual, resulting in an unbiased inner product quantizer. We also provide a formal proof of the information-theoretic lower bounds on best achievable distortion rate by any vector quantizer, demonstrating that TurboQuant closely matches these bounds, differing only by a small constant ($\approx 2.7$) factor. Experimental results validate our theoretical findings, showing that for KV cache quantization, we achieve absolute quality neutrality with 3.5 bits per channel and marginal quality degradation with 2.5 bits per channel. Furthermore, in nearest neighbor search tasks, our method outperforms existing product quantization techniques in recall while reducing indexing time to virtually zero.

Poster

P4-#4708

OR-PRM: A Process Reward Model for Algorithmic Problem in Operations Research

Yilin Wang ⋅ Heng Zhou ⋅ Dongxing Mao ⋅ Linjie Li ⋅ Jingru Tan ⋅ Haochen Han ⋅ Zhengyuan Yang ⋅ Alex Jinpeng Wang ⋅ Min Li

Large language models (LLMs) with Process Reward Models (PRMs) have shown strong reasoning ability, yet their potential in Operations Research (OR) remains unexplored. We present the first PRM tailored for OR, but find that directly training on mainstream datasets yields surprisingly weak performance. To understand this gap, we conduct a systematic analysis and identify the primary bottleneck: the datasets themselves, where over 30\% of annotations are severely flawed. To overcome these limitations, we first collect all existing synthetic datasets and apply a carefully designed filtering pipeline to construct a high-quality seed dataset. Building upon this seed, we then build OR-ProcessQA, the first large-scale dataset for OR with step-by-step supervision, where diverse solution pathways are generated via Monte Carlo Tree Search (MCTS) and each step is validated for logical consistency by GPT-4o. Building on this foundation, we train OR-PRM, the first Process Reward Model in the OR domain, designed to evaluate and guide reasoning at every step rather than only the final outcome. Together, these advances enable OR-PRM to substantially improve LLMs’ reasoning capability, achieving a maximum absolute improvement of 12.5\% over the base model in Best-of-N settings, and highlighting the power of process-oriented supervision for reliable problem solving in operations research.

Poster

P4-#4707

What Happens Next? Anticipating Future Motion by Generating Point Trajectories

Gabrijel Boduljak ⋅ Laurynas Karazija ⋅ Iro Laina ⋅ Christian Rupprecht ⋅ Andrea Vedaldi

We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without the ability to observe other parameters such as the object velocities or the forces applied to them. We formulate this task as conditional generation of dense trajectory grids with a model that closely follows the architecture of modern video generators but outputs motion trajectories instead of pixels. This approach captures scene-wide dynamics and uncertainty, yielding more accurate and diverse predictions than prior regressors and generators. Although recent state-of-the-art video generators are often regarded as world models, we show that they struggle with forecasting motion from a single image, even in simple physical scenarios such as falling blocks or mechanical object interactions, despite fine-tuning on such data. We show that this limitation arises from the overhead of generating pixels rather than directly modeling motion.

Poster

P4-#4706

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

昊李 ⋅ Zhengyu Zou ⋅ Fangfu Liu ⋅ zhang xuanyang ⋅ Fangzhou Hong ⋅ Yukang Cao ⋅ Yushi LAN ⋅ Manyuan Zhang ⋅ Gang Yu ⋅ Dingwen Zhang ⋅ Ziwei Liu

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose Instance-Grounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline. Unlike previous methods that bound with a specific language model, we introduce an Instance-Grounded Scene Understanding paradigm, where instance masks serve as the bridge connecting our unified representation with diverse Visual Language Models (VLMs) in a plug-and-play manner, substantially expanding downstream understanding capabilities. Extensive experiments on instance multi-view instance matching, open-vocabulary segmentation, and QA scene grounding demonstrate that IGGT outperforms state-of-the-art methods in both quality and consistency for semantic 3D reconstruction.

Poster

P4-#4705

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Eran Malach ⋅ Omid Saremi ⋅ Sinead Williamson ⋅ Arwen Bradley ⋅ Aryo Lotfi ⋅ Emmanuel Abbe ⋅ Joshua Susskind ⋅ Etai Littwin

State Space Models (SSMs) have become the leading alternative to Transformers for sequence modeling tasks. Their primary advantage is efficiency in long-context and long-form generation, enabled by fixed-size memory and linear scaling of computational complexity. We begin this work by showing a simple theoretical result stating that SSMs cannot accurately solve any "truly long-form" generation problem (in a sense we formally define), undermining their main competitive advantage. However, we show that this limitation can be mitigated by allowing SSMs interactive access to external tools. In fact, we show that given the right choice of tool access and problem-dependent training data, SSMs can learn to solve any tractable problem and generalize to arbitrary problem length/complexity (i.e., achieve length generalization). Following our theoretical finding, we demonstrate that tool-augmented SSMs achieve remarkable length generalization on a variety of arithmetic, reasoning, and coding tasks. These findings highlight SSMs as a potential efficient alternative to Transformers in interactive tool-based and agentic settings.

Poster

P4-#4704

Vision-Zero: Scalable VLM Self-Evolution via Multi-Agent Self-Play

Qinsi Wang ⋅ Bo Liu ⋅ Tianyi Zhou ⋅ Jing Shi ⋅ Yueqian Lin ⋅ Yiran Chen ⋅ Hai Li ⋅ Kun Wan ⋅ Wentian Zhao

Although reinforcement learning (RL) can effectively enhance the reasoning capabilities of vision–language models (VLMs), current methods remain heavily dependent on labor-intensive datasets that require extensive manual construction and verification, leading to extremely high training costs and consequently constraining the practical deployment of VLMs. To address this challenge, we propose Vision-Zero, a domain-agnostic self-play framework that generates visual deduction games from diverse images for scalable VLM training without human annotations. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model’s reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code will be released upon acceptance.

Poster

P4-#4703

$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

Jing Gu ⋅ Xian Liu ⋅ Yu Zeng ⋅ Ashwin Nagarajan ⋅ Fangrui Zhu ⋅ Daniel Hong ⋅ Yue Fan ⋅ Qianqi Yan ⋅ Kaiwen Zhou ⋅ Ming-Yu Liu ⋅ Xin Wang

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents $PhyWorldBench$ , a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles like object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel "Anti-Physics" category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that could utilize current MLLM to evaluate the physics realism in a zero-shot fashion. We evaluate 10 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with a detailed comparison and analysis. we identify pivotal challenges models face in adhering to real-world physics. Through systematic testing of their outputs across 1,050 curated prompts—spanning fundamental, composite, and anti-physics scenarios—we identify pivotal challenges these models face in adhering to real-world physics. We then rigorously examine their performance on diverse physical phenomena with varying prompt types, deriving targeted recommendations for crafting prompts that enhance fidelity to physical principles.

Poster

P4-#4702

OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment

Liang Lin ⋅ Zhihao Xu ⋅ Junhao Dong ⋅ Jian Zhao ⋅ Yuchen Yuan ⋅ Guibin Zhang ⋅ Miao Yu ⋅ Yiming Zhang ⋅ Zhengtao Yao ⋅ Huahui Yi ⋅ HAICHUAN TANG ⋅ Dongrui Liu ⋅ Xinfeng Li ⋅ Kun Wang

Large language model (LLM) alignment faces a critical dilemma when addressing multiple human preferences: improvements in one dimension frequently come at the expense of others, creating unavoidable trade-offs between competing objectives like helpfulness and harmlessness. While prior work mainly focuses on constraint-based optimization algorithms and data selection strategies to mitigate conflicts, these approaches overlook the fundamental issue of resolving conflicts directly at the parameter level. In this paper, we present OrthAlign, an innovative approach that pioneers a new paradigm by leveraging orthogonal subspace decomposition to fundamentally resolve gradient-level conflicts in multi-objective preference alignment. OrthAlign strategically decomposes parameter update spaces into orthogonal subspaces, ensuring that optimization toward different preferences occurs in mathematically non-interfering directions. Building upon this, we provide theoretical guarantees demonstrating that when parameter increments satisfy both orthogonal subspace constraints and spectral norm bounds, the resulting updates exhibit linear Lipschitz growth rather than exponential instability, ensuring stable convergence across all preference dimensions. Extensive experiments show that: I. OrthAlign achieves maximum single-preference improvements ranging from 34.61% to 50.89% after multiple-objective alignment across helpful, harmless, and truthful dimensions. II. With an average overall reward improvement of 13.96%. Our code is available at https://anonymous.4open.science/r/OrthAlign.

Poster

P4-#4701

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

Chengzhi Mao ⋅ Xudong Lin ⋅ Wen-Sheng Chu

Vision foundation models are typically trained as static feature extractors, forcing the burden of task adaptation onto large downstream models. We propose a different paradigm: instead of solely feeding visual features into language, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time—without requiring task-specific retraining. This enables the encoder to focus attention on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), outperforms vision–language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks---offering a direct path toward adaptive, instruction-driven visual intelligence.

Poster

P4-#4801

Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning

Weipu Zhang ⋅ Adam Jelley ⋅ Trevor McInroe ⋅ Amos Storkey ⋅ Gang Wang

While deep reinforcement learning (RL) from pixels has achieved remarkable success, its sample inefficiency remains a critical limitation for real-world applications. Model-based RL (MBRL) addresses this by learning a world model to generate simulated experience, but standard approaches that rely on pixel-level reconstruction losses often fail to capture small, task-critical objects in complex, dynamic scenes. We posit that an object-centric (OC) representation can direct model capacity toward semantically meaningful entities, improving dynamics prediction and sample efficiency. In this work, we introduce OC-STORM, an object-centric MBRL framework that enhances a learned world model with object representations extracted by a pretrained segmentation network. By conditioning on a minimal number of annotated frames, OC-STORM learns to track decision-relevant object dynamics and inter-object interactions without extensive labeling or access to privileged information. Empirical results demonstrate that OC-STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state-of-the-art sample efficiency on challenging boss fights in the visually complex game Hollow Knight. Our findings underscore the potential of integrating OC priors into MBRL for complex visual domains. Project page: https://oc-storm.weipuzhang.com

Poster

P4-#4802

Emergent Misalignment is Easy, Narrow Misalignment is Hard

Anna Soligo ⋅ Edward Turner ⋅ Senthooran Rajamanoharan ⋅ Neel Nanda

Finetuning large language models on narrowly harmful datasets can cause them to become emergently misaligned, giving stereotypically `evil' responses across diverse unrelated settings. Concerningly, a pre-registered survey of experts failed to predict this result, highlighting our poor understanding of the inductive biases governing learning and generalisation in LLMs. We use emergent misalignment (EM) as a case study to investigate these inductive biases, and find that although models can learn the narrow dataset task, the general solution is measurably more stable and more efficient. To establish this, we first demonstrate that EM is a robust phenomena by introducing new datasets which induce misalignment more consistently and coherently than prior work. We show that different EM finetunes converge to the same linear representation of general misalignment, which can be used to mediate misaligned behaviour. However, a linear representation of the narrow solution also exists, and can be learned by introducing a KL divergence loss. Comparing these representations reveals that general misalignment achieves lower loss, is more robust to perturbations, and is more influential in the pre-training distribution. This work isolates a concrete representation of general misalignment for monitoring and mitigation. More broadly, it offers a detailed case study and metrics for understanding how inductive biases shape generalisation in LLMs.

Poster

P4-#4803

Reinforcing General Reasoning Without Verifiers

Xiangxin Zhou ⋅ Zichen Liu ⋅ Anya Sims ⋅ Haonan Wang ⋅ Tianyu Pang ⋅ Chongxuan Li ⋅ Liang Wang ⋅ Min Lin ⋅ Chao Du

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead directly maximizes the probability of generating the reference answer, derived in a principled way from the RL objective. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks.

Poster

P4-#4804

Context Tokens are Anchors: Understanding the Repeat Curse in dMLLMs from an Information Flow Perspective

Qiyan Zhao ⋅ Xiaofeng Zhang ⋅ Shuochen Chang ⋅ Qianyu Chen ⋅ Xiaosong Yuan ⋅ Xuhang Chen ⋅ LUOQI LIU ⋅ Jiajun Zhang ⋅ Xu-yao Zhang ⋅ Da-Han Wang

Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the Repeat Curse. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model’s growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present CoTA, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA

Poster

P4-#4805

PHAT: Modeling Period Heterogeneity for Multivariate Time Series Forecasting

Jiaming Ma ⋅ Qihe Huang ⋅ Haofeng Ma ⋅ Guanjun Wang ⋅ Sheng Huang ⋅ Zhengyang Zhou ⋅ Pengkun Wang ⋅ Xu Wang ⋅ Binwu Wang ⋅ Yang Wang

While existing multivariate time series forecasting models have advanced significantly in modeling periodicity, they largely neglect the periodic heterogeneity common in real-world data, where variables exhibit distinct and dynamically changing periods. To effectively capture this periodic heterogeneity, we propose PHAT (Period Heterogeneity-Aware Transformer). Specifically, PHAT arranges multivariate inputs into a three-dimensional "periodic bucket" tensor, where the dimensions correspond to variable group characteristics with similar periodicity, time steps aligned by phase, and offsets within the period. By restricting interactions within buckets and masking cross-bucket connections, PHAT effectively avoids interference from inconsistent periods. We also propose a positive-negative attention mechanism, which captures periodic dependencies from two perspectives: periodic alignment and periodic deviation. Additionally, the periodic alignment attention scores are decomposed into positive and negative components, with a modulation term encoding periodic priors. This modulation constrains the attention mechanism to more faithfully reflect the underlying periodic trends. A mathematical explanation is provided to support this property. We evaluate PHAT comprehensively on 14 real-world datasets against 18 baselines, and the results show that it significantly outperforms existing methods, achieving highly competitive forecasting performance. Our sources is available at GitHub.

Poster

P4-#4806

AlphaAgentEvo: Evolution-Oriented Alpha Mining via Self-Evolving Agentic Reinforcement Learning

Ziyi Tang ⋅ Xuexiong Yin ⋅ Weixing Chen ⋅ Zechuan Chen ⋅ Yongsen Zheng ⋅ Wenxuan Ye ⋅ Keze Wang ⋅ Liang Lin

Alpha mining seeks to identify predictive alpha factors that generate excess returns relative to the market from a vast and noisy search space; however, existing evolution-based approaches struggle to facilitate the systematic evolution of alphas. Traditional methods, such as Genetic Programming (GP), cannot interpret natural language instructions and often fail to extract valuable insights from unsuccessful attempts, leading to low interpretability and inefficient exploration. Analogously, without mechanisms for systematic evolution, e.g., long-term planning and reflection, existing multi-agent approaches may easily fall into repetitive evolutionary routines, resulting in inefficient evolution. To overcome these limitations, we introduce AlphaAgentEvo, a self-evolving Agentic Reinforcement Learning (ARL) framework for alpha mining, which moves alpha mining beyond the brittle search-backtest-restart cycle toward a continuous trajectory of evolution. Guided by a hierarchical reward function, our agent engages in self-exploration of the search space, progressively learning basic requirements (e.g., valid tool calls) and then harder objectives (e.g., continuous performance improvements). Through this process, the agent acquires advanced behaviors such as long-horizon planning and reflective reasoning, which enable it to actively react to the underlying state (e.g., market regime shifts) and realize a self-evolving agent, marking a step toward more principled and scalable alpha mining. Extensive experiments demonstrate that AlphaAgentEvo achieves more efficient alpha evolution and generates diverse and transferable alphas, consistently surpassing a wide range of baselines. Notably, with only 4B parameters, it outperforms LLM-driven evolution methods configured with state-of-the-art closed-source reasoning models, highlighting the promise of ARL for next-generation alpha mining.

Poster

P4-#4807

IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li ⋅ Yuanxing Zhang ⋅ Jiangtao Wu ⋅ Zhide Lei ⋅ Yiwen He ⋅ Runzhe Wen ⋅ Chenxi Liao ⋅ Chengkang Jiang ⋅ An Ping ⋅ Shuo Gao ⋅ Suhan Wang ⋅ Zhaozhou Bian ⋅ Zijun Zhou ⋅ Jingyi Xie ⋅ Jiayi Zhou ⋅ Jing Wang ⋅ Yifan Yao ⋅ Weihao Xie ⋅ Yingshui Tan ⋅ Yanghai Wang ⋅ Qianqian Xie ⋅ Zhaoxiang Zhang ⋅ JIAHENG LIU

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlook instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of 26 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

Poster

P4-#4808

Towards Multimodal Data-Driven Scientific Discovery Powered by LLM Agents

Fan Liu ⋅ Xiaozhao Zeng ⋅ Hao Liu

Recent advances in large language models (LLMs) have enabled agents that automate scientific discovery by interpreting data, generating analysis pipelines, and executing them with computational tools. However, existing benchmarks remain largely limited to unimodal datasets and slice-level tasks, overlooking the fact that real discovery requires multimodal integration, modeling, and hypothesis-driven reasoning. To address this gap, we introduce MoSciBench, the first benchmark for multimodal scientific discovery, constructed from peer-reviewed studies through a principled four-stage pipeline. MoSciBench spans six scientific domains, seven data modalities, and five categories of discovery questions, yielding 88 individual, end-to-end, data-driven tasks. Each task is designed as a cross-modal hypothesis verification workflow, requiring agents to align and integrate heterogeneous datasets before modeling and reasoning. We further evaluate four representative agent frameworks across multiple LLM families. Results show that multimodal discovery is substantially harder than unimodal tasks: even the strongest agents achieve only 48.94\% accuracy, with over 60\% of failures due to cross-modal alignment. Lightweight workflow scaffolding consistently improves performance, reducing alignment errors by 5–10\% and raising accuracy by 5.7\% on average. Our benchmark and evaluation framework thus establish a rigorous testbed for advancing LLM agents toward realistic, multimodal scientific discovery.

Poster

P4-#4809

House Of Dextra : Cross-Embodied Co-Design for Dexterous Hands

Kehlani Fay ⋅ Darin Djapri ⋅ Anya Zorin ⋅ James Clinton ⋅ Ali El Lahib ⋅ Hao Su ⋅ Michael Tolley ⋅ Sha Yi ⋅ Xiaolong Wang

Dexterous manipulation is limited by both control and design, without consensus as to what makes manipulators best for performing dexterous tasks. This raises a fundamental challenge: how should we design and control robot manipulators that are optimized for dexterity? We present a co-design framework that learns task-specific hand morphology and complementary dexterous control policies. The framework supports 1) an expansive morphology search space including joint, finger, and palm generation, 2) scalable evaluation across the wide design space via morphology-conditioned cross-embodied control, and 3) real-world fabrication with accessible components. We evaluate the approach across multiple dexterous tasks, including in-hand rotation with simulation and real deployment. Our framework enables an end-to-end pipeline that can design, train, fabricate, and deploy a new robotic hand in under 24 hours. The full framework is open-sourced and available on our website.

Poster

P4-#4810

Graph Tokenization for Bridging Graphs and Transformers

Zeyuan Guo ⋅ Enmao Diao ⋅ Cheng Yang ⋅ Chuan Shi

The success of large pretrained Transformers is closely tied to tokenizers, which convert raw input into discrete symbols. Extending these models to graph-structured data remains a significant challenge. In this work, we introduce a graph tokenization framework that generates sequential representations of graphs by combining reversible graph serialization, which preserves graph information, with Byte Pair Encoding (BPE), a widely adopted tokenizer in large language models (LLMs). To better capture structural information, the graph serialization process is guided by global statistics of graph substructures, ensuring that frequently occurring substructures appear more often in the sequence and can be merged by BPE into meaningful tokens. Empirical results demonstrate that the proposed tokenizer enables Transformers such as BERT to be directly applied to graph benchmarks without architectural modifications. The proposed approach achieves state-of-the-art results on 14 benchmark datasets and frequently outperforms both graph neural networks and specialized graph transformers. This work bridges the gap between graph-structured data and the ecosystem of sequence models. Our code is available at \href{ https://github.com/BUPT-GAMMA/Graph-Tokenization-for-Bridging-Graphs-and-Transformers }{\color{blue}here}.

Poster

P4-#4811

PreferThinker: Reasoning-based Personalized Image Preference Assessment

Shengqi Xu ⋅ Xinpeng Zhou ⋅ Yabo Zhang ⋅ Ming Liu ⋅ Tao Liang ⋅ Tianyu Zhang ⋅ Yalong Bai ⋅ Zuxuan Wu ⋅ Wangmeng Zuo

Personalized image preference assessment aims to evaluate an individual user's image preferences by relying only on a small set of reference images as prior information. Existing methods mainly focus on general preference assessment, training models with large-scale data to tackle well-defined tasks such as text-image alignment. However, these approaches struggle to handle personalized preference because user-specific data are scarce and not easily scalable, and individual tastes are often diverse and complex. To overcome these challenges, we introduce a common preference profile that serves as a bridge across users, allowing large-scale user data to be leveraged for training profile prediction and capturing complex personalized preferences. Building on this idea, we propose a reasoning-based personalized image preference assessment framework that follows a \textit{predict-then-assess} paradigm: it first predicts a user's preference profile from reference images, and then provides interpretable, multi-dimensional scores and assessments of candidate images based on the predicted profile. To support this, we first construct a large-scale Chain-of-Thought (CoT)-style personalized assessment dataset annotated with diverse user preference profiles and high-quality CoT-style reasoning, enabling explicit supervision of structured reasoning. Next, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase to empower the model with structured reasoning capabilities, followed by reinforcement learning to incentivize the model to explore more reasonable assessment paths and enhance generalization. Furthermore, we propose a similarity-aware prediction reward to encourage better prediction of the user's preference profile, which facilitates more reasonable assessments exploration. Extensive experiments demonstrate the superiority of the proposed method. Our code and dataset will be publicly released.

Poster

P4-#4812

Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

Yifu Yuan ⋅ Haiqin Cui ⋅ Yaoting Huang ⋅ Yibin Chen ⋅ Fei Ni ⋅ Zibin Dong ⋅ Pengyi Li ⋅ YAN ZHENG ⋅ Hongyao Tang ⋅ Jianye Hao

Generalization in embodied AI is hindered by the "seeing-to-doing gap", stemming from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. Then we train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.

Poster

P4-#4813

CoCoDiff: Correspondence-Consistent Diffusion Model for Fine-grained Style Transfer

Wenbo Nie ⋅ Zixiang Li ⋅ Renshuai Tao ⋅ Bin WU ⋅ Yunchao Wei ⋅ Yao Zhao

Transferring visual style between images while preserving semantic correspondence between similar objects remains a central challenge in computer vision. While existing methods have made great strides, most of them operate at global level but overlook region-wise and even pixel-wise semantic correspondence. To address this, we propose CoCoDiff, a novel training-free and low-cost style transfer framework that leverages pretrained latent diffusion models to achieve fine-grained, semantically consistent stylization. We identify that correspondence cues within generative diffusion models are under-explored and that content consistency across semantically matched regions is often neglected. CoCoDiff introduces a pixel-wise semantic correspondence module that mines intermediate diffusion features to construct a dense alignment map between content and style images. Furthermore, a cycle-consistency module then enforces structural and perceptual alignment across iterations, yielding object and region level stylization that preserves geometry and detail. Despite requiring no additional training or supervision, CoCoDiff delivers state-of-the-art visual quality and strong quantitative results, outperforming methods that rely on extra training or annotations.

Poster

P4-#4814

CPQS-Tuning: A Model Self-Perception-Based Data Filtering Algorithm for Efficient Instruction Fine-Tuning

YI REN ⋅ Yanhui Li ⋅ Tianyi Zhang ⋅ Diandong Liu

Instruction fine-tuning is a key technique for enhancing the performance of large language models (LLMs), but low-quality and redundant data often hinder its effectiveness. Recent studies suggest that filtering a small amount of high-quality data for instruction fine-tuning can achieve faster and more efficient training performance. However, existing data filtering approaches predominantly depend on predefined evaluation models or manually designed metrics, without leveraging information from the target LLM itself. This limitation may result in a mismatch between the filtering criteria and the actual requirements of the LLM being fine-tuned, thereby reducing the effectiveness of the fine-tuning process. To address these issues, we propose a novel perspective: the hidden states of LLMs implicitly reflect the quality of the training data. Based on this insight, we propose a novel data filtering method that extracts the hidden states that reflect the target LLM’s perception of the data as representative features, and builds a data classification model upon them, which outputs the Contrastive Perception Quality Score (CPQS) for dataset filtering. Our experiments are conducted in both general and downstream domains. (1) In the general domain, our experiments show that training on under 10\% of the data from both the Alpaca_GPT4 and DeepSeek-R1 synthesized reasoning datasets enables our method to outperform models trained on the complete datasets. Moreover, it surpasses the performance of current state-of-the-art data-selection techniques. (2) In downstream tasks, our approach delivers an average performance gain exceeding 3.6\% over leading data-selection algorithms across multiple benchmarks, including GSM8K, HumanEval, and HumanEval-Plus.

Poster

P4-#4815

Physics-Informed Inference Time Scaling for Solving High-Dimensional Partial Differential Equations

Zexi Fan ⋅ Yan Sun ⋅ Shihao Yang ⋅ Yiping Lu

Solving high-dimensional partial differential equations (PDEs) is a critical challenge where modern data-driven solvers often lack reliability and rigorous error guarantees. We introduce Simulation-Calibrated Scientific Machine Learning (SCaSML), a framework that systematically improves pre-trained PDE solvers at inference time without any retraining. Our core idea is to derive a new PDE, which we term the Law of Defect, that precisely governs the error of a given surrogate model. Because this defect PDE retains the structure of the original problem, we can solve it efficiently with traditional stochastic simulators, yielding a targeted correction to the initial machine-learned solution. We prove that SCaSML achieves a faster convergence rate, with a final error bounded by the product of the surrogate and simulation errors. On challenging PDEs up to 160 dimensions, SCaSML reduces the error of various surrogate models, including PINNs and Gaussian Processes, by 20-80%. SCaSML provides a principled method to fuse the speed of machine learning with the rigor of numerical simulation, enhancing the trustworthiness of Al for scientific discovery.

Poster

P4-#4816

Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test

Ziyue Li ⋅ Chenrui Fan ⋅ Tianyi Zhou

This paper presents the first study of grokking in practical LLM pretraining. Specifically, we investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two. Unlike existing works studying when a small model generalizes to limited and specified tasks during thousands epochs' training on algorithmic data, we focus on a practical setting for LLMs, i.e., near single-pass pretraining of next-token prediction on a cross-domain, large-scale corpus, and generalization on diverse benchmark tasks covering math/commonsense reasoning, code generation, and domain-specific retrieval. Our study, for the first time, verifies that grokking still emerges in pretraining mixture-of-experts (MoE) LLMs, though different local data groups may enter their grokking stages asynchronously due to the heterogeneity of their distributions and attributions to others. To find a mechanistic interpretation of this local grokking, we investigate the dynamics of training data's pathways (i.e., expert choices across layers in MoE). Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss. This depicts a transition from memorization to generalization. Two novel metrics are developed to quantify these patterns: one computes the pathway similarity between samples, while the other measures the consistency of aggregated experts between subsequent layers for each sample. These training data based metrics induce near zero cost but can faithfully track and monitor the generalization of LLMs on downstream tasks, reducing reliance on costly instruction tuning and benchmark evaluations. We also ground our findings in a theoretical analysis of one-layer MoE, showing that more structured pathways improve the generalization bound.

Poster

P4-#4817

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Chenyu Zhou ⋅ Tianyi Xu ⋅ Jianghao Lin ⋅ Dongdong Ge

Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the $\textit{credit assignment problem}$, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is $\textit{myopic}$, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce $\textbf{\texttt{StepORLM}}$, a novel self-evolving framework with generative process supervision. At its core, $\texttt{StepORLM}$ features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter $\texttt{StepORLM}$ establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs. We release our models and code to facilitate future research (https://github.com/0xzhouchenyu/StepORLM).

Poster

P4-#4818

Topological Anomaly Quantification for Semi-supervised Graph Anomaly Detection

Ting Guo ⋅ Yangrui Fan ⋅ Caixia Cui ⋅ Jiye Liang ⋅ Jiao Zhao ⋅ Da Wang

Semi-supervised graph anomaly detection identifies nodes deviating from normal patterns using a limited set of labeled nodes. This paper specifically addresses the challenging scenario where only normal node labels are available. To address the challenge of anomaly scarcity in real-world graphs, generative-based methods synthesize anomalies by linear/non-linear interpolation or random noise perturbation. However, these methods lack a quantitative assessment of anomalies, hindering the reliability of the generated ones. To overcome this limitation, we propose a generative graph anomaly detection model based on topological anomaly quantification (TAQ-GAD). First, we design a topological anomaly quantification module (TAQ), which quantifies node abnormality through two topological metrics: The node boundary score (NBS) quantifies the boundaryness of a node by evaluating its connectivity to labeled normal neighbors. The node isolation score (NIS) assesses the structural isolation of a node by evaluating its connection strength to other nodes within the same category. This anomaly measurement module dynamically screens nodes with high anomaly scores as pseudo-anomaly nodes. Subsequently, the topological anomaly enhancement (TAE) module generates virtual anomaly center nodes and constructs their topological relationships with other nodes. Finally, the method integrates normal and pseudo-anomaly nodes on the enhanced graph for model training. Extensive experiments on benchmark datasets demonstrate TAQ-GAD’s superiority over state-of-the-art methods and effectively improve anomaly detection performance.

Poster

P4-#4918

GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models

Xingyu Zhu ⋅ Beier Zhu ⋅ Junfeng Fang ⋅ Shuo Wang ⋅ Yin Zhang ⋅ Xiang Wang ⋅ Xiangnan He

Large vision-language models (LVLMs) have achieved remarkable progress in vision–language reasoning tasks, yet ensuring their safety remains a critical challenge. Recent input-side defenses detect unsafe images with CLIP and prepend safety prefixes to prompts, but they still suffer from inaccurate detection in complex scenes and unstable safety signals during decoding. To address these issues, we propose GuardAlign, a training-free defense framework that integrates two strategies. First, OT-enhanced safety detection leverages optimal transport to measure distribution distances between image patches and unsafe semantics, enabling accurate identification of malicious regions without additional computational cost. Second, cross-modal attentive calibration strengthens the influence of safety prefixes by adaptively reallocating attention across layers, ensuring that safety signals remain consistently activated throughout generation. Extensive evaluations on six representative MLLMs demonstrate that GuardAlign reduces unsafe response rates by up to 39\% on SPA-VL, while preserving utility, achieving an improvement on VQAv2 from 78.51\% to 79.21\%.

Poster

P4-#4917

Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs

Feng Hong ⋅ Geng Yu ⋅ Yushi Ye ⋅ Haicheng Huang ⋅ Huangjie Zheng ⋅ Ya Zhang ⋅ Yanfeng Wang ⋅ Jiangchao Yao

Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models, designed for fast parallel generation. However, existing DLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation. We attribute this to the irreversibility of standard decoding in DLLMs, which is easily polarized into the wrong decoding direction along with early error context accumulation. To resolve this, we introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding in DLLMs. WINO employs a parallel draft-and-verify mechanism, aggressively drafting multiple tokens while simultaneously using the model's bidirectional context to verify and re-mask suspicious ones for refinement. Verified in open-source DLLMs like LLaDA and MMaDA, WINO is shown to decisively improve the quality-speed trade-off. For instance, on the GSM8K math benchmark, it accelerates inference by 6 while improving accuracy by 2.58%; on Flickr30K captioning, it achieves a 10 speedup with higher performance. More comprehensive experiments are conducted to demonstrate the superiority and provide an in-depth understanding of WINO.

Poster

P4-#4916

Quasi-Equivariant Metanetworks

Viet-Hoang Tran ⋅ An Nguyen ⋅ Benoît Guérand ⋅ Thieu Vo ⋅ Tan Nguyen

Metanetworks are neural architectures designed to operate directly on pretrained weights to perform downstream tasks. However, the parameter space serves only as a proxy for the underlying function class, and the parameter-function mapping is inherently non-injective: distinct parameter configurations may yield identical input-output behaviors. As a result, metanetworks that rely solely on raw parameters risk overlooking the intrinsic symmetries of the architecture. Reasoning about functional identity is therefore essential for effective metanetwork design, motivating the development of equivariant metanetworks, which incorporate equivariance principles to respect architectural symmetries. Existing approaches, however, typically enforce strict equivariance, which imposes rigid constraints and often leads to sparse and less expressive models. To address this limitation, we introduce the novel concept of quasi-equivariance, which allows metanetworks to move beyond the rigidity of strict equivariance while still preserving functional identity. We lay down a principled basis for this framework and demonstrate its broad applicability across diverse neural architectures, including feedforward, convolutional, and transformer networks. Through empirical evaluation, we show that quasi-equivariant metanetworks achieve good trade-offs between symmetry preservation and representational expressivity. These findings advance the theoretical understanding of weight-space learning and provide a principled foundation for the design of more expressive and functionally robust metanetworks.

Poster

P4-#4915

Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment

Hongbin Zhang ⋅ Kehai Chen ⋅ Xuefeng Bai ⋅ Yang Xiang ⋅ Min Zhang

Reward models (RMs) are crucial for aligning large language models (LLMs) with diverse cultures. Consequently, evaluating their cultural awareness is essential for further advancing global alignment of LLMs. However, existing RM evaluations fall short in assessing cultural awareness due to the scarcity of culturally relevant evaluation datasets. To fill this gap, we propose Cultural Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs reveals their deficiencies in modeling cultural awareness and demonstrates a positive correlation between performance on CARB and downstream multilingual cultural alignment tasks. Further analysis identifies the spurious correlations within culture-aware reward modeling, wherein RM's scoring relies predominantly on surface-level features rather than authentic cultural nuance understanding. To address these, we propose Think-as-Locals to elicit deeper culturally grounded reasoning from generative RMs via reinforcement learning from verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate preference judgments and high-quality structured evaluation criteria generation. Experimental results validate its efficacy in mitigating spurious features interference and advancing culture-aware reward modeling.

Poster

P4-#4914

Helmsman: Autonomous Synthesis of Federated Learning Systems via Collaborative LLM Agents

Haoyuan Li ⋅ Mathias Funk ⋅ Aaqib Saeed

Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel LLM-based multi-agent framework that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised generative agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of LLM-driven agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems. Code is available at: https://github.com/haoyuan-l/Helmsman.

Poster

P4-#4913

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Mateusz Michalkiewicz ⋅ Anekha Sokhal ⋅ Tadeusz Michalkiewicz ⋅ Piotr Pawlikowski ⋅ Mahsa Baktashmotlagh ⋅ Varun Jampani ⋅ Guha Balakrishnan

Monocular 3D reconstruction methods and vision-language models (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ, a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images and corresponding 3D meshes of diverse polyhedra—including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes—covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via non-linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available at toomanymatts.github.io/giq-benchmark/, providing a structured platform to benchmark critical gaps in geometric intelligence and facilitate future progress in robust, geometry-aware representation learning.

Poster

P4-#4912

Learning to Orchestrate Agents in Natural Language with the Conductor

Stefan Nielsen ⋅ Edoardo Cetin ⋅ Peter Schwendeman ⋅ Qi Sun ⋅ Jinglue Xu ⋅ Yujin Tang

Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.

Poster

P4-#4911

Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation

Dian XIE ⋅ Shitong Shao ⋅ Lichen Bai ⋅ zikai zhou ⋅ Bojun Cheng ⋅ Shuo Yang ⋅ JUN WU ⋅ Zeke Xie

Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.

Poster

P4-#4910

Data-to-Energy Stochastic Dynamics

Kirill Tamogashev ⋅ Nikolay Malkin

The Schrödinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms enable the inference of such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schrödinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schrödinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method.

Poster

P4-#4909

Score Distillation Beyond Acceleration: Generative Modeling from Corrupted Data

Yasi Zhang ⋅ Tianyu Chen ⋅ Zhendong Wang ⋅ Yingnian Wu ⋅ Mingyuan Zhou ⋅ Oscar Leong

Learning generative models directly from corrupted observations is a long-standing challenge across natural and scientific domains. We introduce **Restoration Score Distillation (RSD)**, a unified framework for learning high-fidelity, one-step generative models using **only** degraded data of the form $ y = \mathcal{A}(x) + \sigma \varepsilon, x\sim p_X,\ \varepsilon\sim \mathcal{N}(0,I_m), $ where the mapping $\mathcal{A}$ may be the identity or a non-invertible corruption operator (e.g., blur, masking, subsampling, Fourier acquisition). RSD first pretrains a corruption-aware diffusion teacher on the observed measurements, then *distills* it into an efficient one-step generator whose samples are statistically closer to the clean distribution $p_X$. The framework subsumes identity corruption (denoising task) as a special case of our general formulation. Empirically, RSD consistently reduces Fr\'echet Inception Distance (FID) relative to corruption-aware diffusion teachers across noisy generation (CIFAR-10, FFHQ, CelebA-HQ, AFHQ-v2), image restoration (Gaussian deblurring, random inpainting, super-resolution, and mixtures with additive noise), and multi-coil MRI—*without access to any clean images*. The distilled generator inherits one-step sampling efficiency, yielding up to $30\times$ speedups over multi-step diffusion while surpassing the teachers after substantially fewer training iterations. These results establish score distillation as a practical tool for generative modeling from corrupted data, *not merely for acceleration*. We provide theoretical support for the use of distillation in enhancing generation quality in the Appendix. The code is available at https://github.com/TianyuCodings/RSD.

Poster

P4-#4908

Anchored Supervised Fine-Tuning

He Zhu ⋅ Junyou Su ⋅ Peng Lai ⋅ Ren Ma ⋅ Wenjia Zhang ⋅ Linyi Yang ⋅ Guanhua Chen

Post-training of large language models involves a fundamental trade-off between supervised fine-tuning (SFT), which efficiently mimics demonstrations but tends to memorize, and reinforcement learning (RL), which achieves better generaliza- tion at higher computational cost. Dynamic Fine-Tuning (DFT) recently emerged as a promising middle ground, reweighting SFT objectives with token probabili- ties and achieving improvements in certain reasoning domains, though it exhibits instability in other tasks. We provide a analysis of DFT through the reward- weighted regression (RWR) framework, revealing that it corresponds to a spe- cific auxiliary distribution choice that yields provably tighter RL bounds than standard SFT. However, our analysis also uncovers a critical limitation: this con- struction lacks distributional anchoring, leading to progressive drift that under- mines training stability. To address this, we propose Anchored Supervised Fine- Tuning (ASFT), which augments DFT’s reweighting with lightweight KL regu- larization to preserve tightness while ensuring stability. Empirically, ASFT con- sistently outperforms both SFT and DFT across mathematical reasoning, medical knowledge grounding, and code generation, achieving substantial improvements with minimal computational overhead. Our RWR framework provides a system- atic lens for understanding post-training methods and demonstrates that principled theoretical analysis leads to both stronger guarantees and practical gains.

Poster

P4-#4907

UrbanFeel：A Comprehensive Benchmark for Temporal and Perceptual Understanding of City Scenes through Human Perspective

Jun He ⋅ Yi Lin ⋅ Zilong Huang ⋅ Jiacong Yin ⋅ Junyan Ye ⋅ Yuchuan Zhou ⋅ Weijia Li ⋅ Xiang Zhang

Urban development impacts over half of the global population, making human-centered understanding of its structural and perceptual changes essential for smart city planning. While Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various domains, existing benchmarks that explore their performance in urban environments remain limited, lacking systematic exploration of temporal evolution and subjective perception of urban environment that aligns with human perception. To address these limitations, we propose UrbanFeel, a comprehensive benchmark designed to evaluate the performance of MLLMs in urban development understanding and subjective environmental perception. UrbanFeel comprises 14.3K carefully constructed visual questions spanning three cognitively progressive dimensions: Static Scene Perception, Temporal Change Perception, and Subjective Environmental Perception. We collect multi-temporal single-view and panoramic street-view images from 11 representative cities worldwide, and generate high-quality question-answer pairs through a hybrid pipeline of spatial clustering, rule-based generation, model-assisted prompting, and manual annotation. Through extensive evaluation of 20 state-of-the-art MLLMs, we observe that Gemini-2.5 Pro achieves the best overall performance, with its accuracy approaching human expert levels and narrowing the average gap to just 1.5%. Most models perform well on tasks grounded in scene understanding. In particular, some models even surpass human annotators in pixel-level change detection. However, performance drops notably in tasks requiring temporal reasoning over urban development. Additionally, in the subjective perception dimension, several models reach human-level or even higher consistency in evaluating dimension such as beautiful and safety. Our results suggest that MLLMs are demonstrating rudimentary emotion understanding capabilities. The code and dataset of this work will be released at https://github.com/Hejun0915/UrbanFeel .

Poster

P4-#4906

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

Xiangyu Zhao ⋅ Lin ⋅ Tianhao Liang ⋅ Yifan Zhou ⋅ Wenhao Chai ⋅ Yuzhe Gu ⋅ Weiyun Wang ⋅ Kai Chen ⋅ Gen Luo ⋅ Junchi Yan ⋅ Wenwei Zhang ⋅ Hua Yang ⋅ Haodong Duan ⋅ Xue Yang

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

Poster

P4-#4905

PrefDisco: Benchmarking Proactive Personalized Reasoning

Stella Li ⋅ Avinandan Bose ⋅ Faeze Brahman ⋅ Simon Du ⋅ Pang Wei Koh ⋅ Maryam Fazel ⋅ Yulia Tsvetkov

Current large language model (LLM) development treats task-solving and preference-alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to proactively identify what they don’t know about the user, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly—a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PREFALIGN as a fine-grained rubric-based metric for measuring preference alignment. PREFDISCO builds scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO provides a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.

Poster

P4-#4904

Arbitrary-Order Block SignSGD for Memory-Efficient LLM Fine-Tuning

Yijie Zhou ⋅ Shi Pu

We propose \textbf{ABSignSGD}, a block‑coordinate variant of sign-based descent with flexible block selection that enables memory‑ and runtime‑efficient full‑parameter fine‑tuning of large language models. We present a unified convergence analysis under mild conditions, covering both the base method and a \textit{majority‑vote} extension for distributed training. The latter improves communication efficiency by aggregating only gradient signs rather than averaging full gradients. Experiments on \textcolor{blue}{Qwen3‑8B, Llama3-8B, and Qwen3-32B}, spanning mathematical reasoning and general instruction‑following tasks, show that ABSignSGD converges faster per iteration and delivers superior downstream performance while reducing both runtime and memory usage compared to existing methods. Ablation studies further indicate that the memoryless sign-based update naturally complements block‑wise updates, explaining the method’s strong empirical performance.

Poster

P4-#4903

Multi-Subspace Multi-Modal Modeling for Diffusion Models: Estimation, Convergence and Mixture of Experts

Ruofeng Yang ⋅ Yongcan Li ⋅ Bo Jiang ⋅ Cheng Chen ⋅ Shuai Li

Recent diffusion models demonstrate remarkable sample efficiency and fast optimization, contradicting standard estimation bounds that suffer from the curse of dimensionality $n^{-1/D}$ with the data dimension $D$. Since images are usually a union of low-dimensional manifolds, current works model the data as a union of linear subspaces with Gaussian latent and achieve a $1/\sqrt{n}$ bound. Though this modeling reflects the multi-manifold property, the Gaussian latent can not capture the multi-modal property of the latent manifold. To bridge this gap, we propose the mixture subspace of low-rank mixture of Gaussian (MoLR-MoG) modeling, which models the target data as a union of $K$ linear subspaces, and each subspace admits a mixture of Gaussian latent ($n_k$ modals with dimension $d_k$). With this modeling, the corresponding score function naturally has a mixture of expert (MoE) structure, captures the multi-modal information, and contains nonlinear property. Empirically, our MoE-latent MoG network significantly outperforms MoLRG Gaussian baselines and matches MoE-latent U-Net performance with $10\times$ fewer parameters, validating its practical suitability. Theoretically, we provide provable convergence guarantees for the optimization process and establish an estimation error bound of $R^4\sqrt{\sum_{k=1}^K n_k}\sqrt{\sum_{k=1}^K n_k d_k}/\sqrt{n}$, successfully escaping the dimensionality curse. Collectively, with MoLR-MoG modeling, this work explains why diffusion models only require a small training sample and enjoy a fast optimization process. Furthermore, we also show the potential of MoE structure for diffusion models from the manifold perspective.

Poster

P4-#4901

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

Haotian Ye ⋅ Qiyuan He ⋅ Jiaqi Han ⋅ Puheng Li ⋅ Jiaojiao Fan ⋅ Zekun Hao ⋅ Fitsum Reda ⋅ Yogesh Balaji ⋅ Huayu Chen ⋅ Sheng Liu ⋅ Angela Yao ⋅ James Y Zou ⋅ Stefano Ermon ⋅ Haoxiang Wang ⋅ Ming-Yu Liu

Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces \alg, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving $20\%$ tokens without influence on performance, and achieving $2.3\times$ compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, \alg enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.

Poster

P4-#5001

WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Kuan Li ⋅ Zhongwang Zhang ⋅ Huifeng Yin ⋅ Rui Ye ⋅ Yida Zhao ⋅ Liwen Zhang ⋅ Litu Ou ⋅ Ding-Chu Zhang ⋅ Xixi Wu ⋅ Xinmiao Yu ⋅ Jialong Wu ⋅ Xinyu Wang ⋅ Zile Qiao ⋅ Zhen Zhang ⋅ Yong Jiang ⋅ Pengjun Xie ⋅ Fei Huang ⋅ Zhiqin Xu ⋅ Shuai Wang ⋅ Minhao Cheng ⋅ Jingren Zhou

To significantly advance the capabilities of open-source web agents, we present WebSailor-V2, a complete post-training pipeline encompassing data construction, Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL). Our methodology features two key innovations: (1) On the data front, we developed SailorFog-QA-2, a novel dataset built from a densely interconnected knowledge graph that introduces a wide variety of uncertainties beyond simple obfuscation, fostering more sophisticated reasoning. (2) For training, we engineered a dual-environment RL framework, combining a high-fidelity simulator for rapid, low-cost algorithmic iteration with a robust, managed real-world environment for stable final policy training, all integrated within a symbiotic data-policy feedback loop. Trained on the Qwen3-30B-A3B model, WebSailor-V2 achieves state-of-the-art results, scoring 35.3 on BrowseComp-EN, 44.1 on BrowseComp-ZH, and 30.6 on Humanity's Last Exam (HLE). Notably, our 30B-A3B MOE agent significantly outperforms all existing open-source agents and surpasses even the 671B DeepSeek-V3.1, demonstrating performance competitive with leading proprietary systems.

Poster

P4-#5002

TUMIX: Multi-Agent Test-Time Scaling with Tool-Use Mixture

Yongchao Chen ⋅ Jiefeng Chen ⋅ Rui Meng ⋅ Ji Yin ⋅ Na Li ⋅ Chuchu Fan ⋅ Chi Wang ⋅ Tomas Pfister ⋅ Jinsung Yoon

While integrating tools like Code Interpreter and Search has significantly enhanced Large Language Model (LLM) reasoning in models like ChatGPT Agent and Gemini-Pro, practical guidance on optimal tool use is lacking. The core challenge is effectively combining textual reasoning, coding, and search for diverse questions. In this paper, we propose Tool-Use Mixture (TUMIX), an ensemble framework that runs multiple agents in parallel, each employing distinct tool-use strategies and answer paths. Agents in TUMIX iteratively share and refine responses based on the question and previous answers. In experiments, TUMIX achieves significant gains over state-of-the-art tool-augmented and test-time scaling methods, delivering an average accuracy improvement of up to 3.55\% over the best baseline on Gemini-2.5-Pro and Gemini-2.5-Flash across key reasoning benchmarks, with near-equal inference costs. We find that agent diversity and quality are crucial and can be enhanced by using LLMs to auto-optimize agent designs. Furthermore, TUMIX can halt refinement upon reaching sufficient confidence, preserving performance at only 49\% of the inference cost. Further scaling can achieve higher performance, albeit at a greater cost.

Poster

P4-#5003

A General Framework for Black-Box Attacks Under Cost Asymmetry

Mahdi Salmani ⋅ Alireza Abdollahpoorrostam ⋅ Seyed-Mohsen Moosavi-Dezfooli

Traditional decision-based black-box adversarial attacks on image classifiers aim to generate adversarial examples by slightly modifying input images while keeping the number of queries low, where each query involves sending an input to the model and observing its output. Most existing methods assume that all queries have equal cost. However, in practice, queries may incur asymmetric costs; for example, in content moderation systems, certain output classes may trigger additional review, enforcement, or penalties, making them more costly than others. While prior work has considered such asymmetric cost settings, effective algorithms for this scenario remain underdeveloped. In this paper, we introduce asymmetric black-box attacks, a new family of decision-based attacks that generalize to the asymmetric query-cost setup. We develop new methods for boundary search and gradient estimation when crafting adversarial examples. Specifically, we propose Asymmetric Search (AS), a more conservative alternative to binary search that reduces reliance on high-cost queries, and Asymmetric Gradient Estimation (AGREST), which shifts the sampling distribution in Monte Carlo style gradient estimation to favor low-cost queries. We design efficient algorithms that reduce total attack cost by balancing different query types, in contrast to earlier methods such as stealthy attacks that focus only on limiting expensive (high-cost) queries. We perform both theoretical analysis and empirical evaluation on standard image classification benchmarks. Across various cost regimes, our method consistently achieves lower total query cost and smaller perturbations than existing approaches, reducing the perturbation norm by up to 40\% in some settings.

Poster

P4-#5004

Faithfulness Under the Distribution: A New Look at Attribution Evaluation

Zhiyu Zhu ⋅ Zhibo Jin ⋅ Jiayu Zhang ⋅ Bartlomiej Sobieski ⋅ Przemyslaw Biecek ⋅ Fang Chen ⋅ Jianlong Zhou

Evaluating the faithfulness of attribution methods remains an open challenge. Standard metrics such as Insertion and Deletion Scores rely on heuristic input perturbations (e.g., zeroing pixels), which often push samples out of the data distribution (OOD). This can distort model behavior and lead to unreliable evaluations. We propose FUD, a novel evaluation framework that reconstructs masked regions using score-based diffusion models to produce in-distribution, semantically coherent inputs. This distribution-aware approach avoids the common pitfalls of existing Attribution Evaluation Methods (AEMs) and yields assessments that more accurately reflect attribution faithfulness. Experiments across models show that FUD produces significantly different—and more reliable—judgments than prior approaches. Our implementation is available at: https://github.com/LMBTough/FUD.

Poster

P4-#5005

MARTI: A Framework for Multi-Agent LLM Systems Reinforced Training and Inference

Kaiyan Zhang ⋅ Kai Tian ⋅ Runze Liu ⋅ Sihang Zeng ⋅ Xuekai Zhu ⋅ Guoli Jia ⋅ Yuchen Fan ⋅ Xingtai Lv ⋅ Yuxin Zuo ⋅ Che Jiang ⋅ Yuru wang ⋅ Jianyu Wang ⋅ Ermo Hua ⋅ Xinwei Long ⋅ Junqi Gao ⋅ Youbang Sun ⋅ Zhiyuan Ma ⋅ Ganqu Cui ⋅ Ning Ding ⋅ Biqing Qi ⋅ Bowen Zhou

We present MARTI (Multi-Agent Reinforced Training and Inference), an open-source framework designed to facilitate scalable and efficient learning of multi-agent LLM systems. MARTI supports centralized multi-agent interactions and distributed policy training, with the added capability of multi-turn asynchronous rollouts to enhance training efficiency. The framework includes dynamic workflows for multi-agent interactions, which integrate both rule-based verifiable rewards and LLM-based generative rewards. We validate the effectiveness of MARTI through comprehensive experiments on diverse mathematical tasks, demonstrating that multi-agent LLM-based systems outperform single-agent systems within the same inference budget after convergence. Our contributions lay the foundation for exploring scalable collaborations within LLM-based multi-agent systems and advancing the capabilities of large reasoning models.

Poster

P4-#5006

Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models

Zemin Huang ⋅ Yuhang Wang ⋅ Zhiyang Chen ⋅ Guo-Jun Qi

Mask-based Diffusion Language Models (DLMs) struggle to revise incorrect tokens: once a token is generated, it typically remains fixed. The key challenge is to identify potential errors in the inputs. In this paper, we propose Remasking-enabled Diffusion Language Model (RemeDi), a mask-based DLM that introduces remasking as another fundamental mechanism, enabling more flexible text refinement in diffusion-based text generation. To achieve this, RemeDi jointly predicts token distributions and per-token confidence scores at each step. The confidence scores determine which tokens to be unmasked after the current step, allowing the model to identify tokens with low quality and remask them. These remasked tokens can be resampled with richer context in subsequent steps. We design a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens in addition to predict mask tokens, and reinforcement learning which optimizes full generation trajectories toward higher rewards. Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.

Poster

P4-#5007

Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models

Xuyang Liu ⋅ Xiyan Gui ⋅ Yuchao Zhang ⋅ Linfeng Zhang

Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose MixKV, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. MixKV adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that MixKV consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), MixKV improves baseline methods by an average of 5.1% across five multi-modal understanding benchmarks and achieves remarkable gains of 8.0% and 9.0% for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, MixKV extends seamlessly to LLMs with comparable performance gains.

Poster

P4-#5008

Multi-Marginal Flow Matching with Adversarially Learnt Interpolants

Oskar Kviman ⋅ Kirill Tamogashev ⋅ Nicola Branchini ⋅ Víctor Elvira ⋅ Jens Lagergren ⋅ Nikolay Malkin

Learning the dynamics of a process given sampled observations at several time points is an important but difficult task in many scientific applications. When no ground-truth trajectories are available, but one has only snapshots of data taken at discrete time steps, the problem of modelling the dynamics, and thus inferring the underlying trajectories, can be solved by multi-marginal generalisations of flow matching algorithms. This paper introduces a novel flow matching method that overcomes the limitations of existing multi-marginal trajectory inference algorithms. Our proposed method, ALI-CFM, uses a GAN-inspired adversarial loss to fit neurally parameterised interpolant curves between source and target points such that the marginal distributions at intermediate time points are close to the observed distributions. The resulting interpolants are smooth trajectories that, as we show, are unique under mild assumptions. These interpolants are subsequently marginalised by a flow matching algorithm, yielding a trained vector field for the underlying dynamics. ALI-CFM outperforms existing baselines on spatial transcriptomics and cell tracking problems, while performing on par with them on single-cell trajectory prediction, which showcases its versatility and scalability.

Poster

P4-#5011

Asynchronous Matching with Dynamic Sampling for Multimodal Dataset Distillation

Ding Qi ⋅ Jian Li ⋅ Shuguang Dou ⋅ Zifan Song ⋅ Junyao Gao ⋅ Yabiao Wang ⋅ Chengjie Wang ⋅ Cai Zhao

Multimodal Dataset Distillation (MDD) has emerged as a vital paradigm for enabling efficient training of vision-language models (VLMs) in the era of multimodal data proliferation. Unlike traditional dataset distillation methods that focus on single-modal tasks, MDD presents distinct challenges: (i) the effective distillation of heterogeneous multimodal knowledge, complicated by feature space misalignment and asynchronous optimization dynamics; and (ii) the lack of discrete class guidance, which hinders the distribution coverage and representativeness of synthetic data due to the vastness and continuity of the semantic space. To address these challenges, this paper proposes an Asynchronous Matching with Dynamic sampling (AMD) framework. AMD enables asynchronous trajectory matching by decoupling the selection of starting points for image and text trajectories. Additionally, a Semantics-Aware Prototype Mining module is introduced, which replaces random initialization by leveraging feature-space clustering to identify representative prototypes, enhancing the coverage and representativeness of the distilled samples. Extensive experiments demonstrate that AMD achieves superior distillation performance on Flickr30k and COCO (e.g., IR@1, IR@5, and IR@10 \textbf{gains of 4.5\%, 9.6\%, and 10.9\%}, respectively, on Flickr30k 200 pairs.) with negligible computational overhead.

Poster

P4-#5012

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

Zhuo Li ⋅ Pengyu Cheng ⋅ Zhechao Yu ⋅ FeifeiTong ⋅ Anningzhe Gao ⋅ Tsung-Hui Chang ⋅ Xiang Wan ⋅ erchao.zec ⋅ xiaoxi jiang ⋅ guanjunjiang

Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, e.g., Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called Debiasing via Information optimization for RM (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: response length, sycophancy, and format. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

Poster

P4-#5013

Optimizing Canaries for Privacy Auditing with Metagradient Descent

Matteo Boglioni ⋅ Terrance Liu ⋅ Andrew Ilyas ⋅ Steven Wu

In this work we study black-box privacy auditing, where the goal is to lower bound the privacy parameter of a differentially private learning algorithm using only the algorithm’s outputs (i.e., final trained model). For DP-SGD (the most successful method for training differentially private deep learning models), the canonical auditing approach uses membership inference—an auditor comes with a small set of special “canary” examples, inserts a random subset of them into the training set, and then tries to discern which of their canaries were included in the training set (typically via a membership inference attack). The auditor’s success rate then provides a lower bound on the privacy parameters of the learning algorithm. Our main contribution is a method for optimizing the auditor’s canary set to improve privacy auditing, leveraging recent work on metagradient optimization (Engstrom et al., 2025). Our empirical evaluation demonstrates that in certain instances, using such optimized canaries can improve empirical lower bounds for differentially private image classification models by several times when compared to canaries proposed in prior work. Furthermore, we demonstrate that our method is DP-SGD agnostic and efficient: canaries optimized for non-private SGD with a small model architecture remain effective when auditing larger models trained with DP-SGD.

Poster

P4-#5014

PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models

Soufiane Hayou ⋅ Nikhil Ghosh ⋅ Bin Yu

Low-Rank Adaptation is a widely used finetuning method for large models. Its small memory footprint allows practitioners to adapt large models to specific tasks at a fraction of the cost of full finetuning. Different modifications have been proposed to enhance its efficiency by, for example, setting the learning rate, the rank, and the initialization. Another improvement axis is adapter placement strategy: when using LoRA, practitioners usually pick \emph{module types} to adapt with LoRA, such as Query and Key modules. Few works have studied the problem of adapter placement, with nonconclusive results: original LoRA paper suggested placing adapters in attention modules, while other works suggested placing them in the MLP modules. Through an intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a lightweight method that allows automatic identification of module types where LoRA adapters should be placed, given a pretrained model and a finetuning task. We demonstrate that PLoP consistently outperforms, and in the worst case competes, with commonly used placement strategies through comprehensive experiments on supervised finetuning and reinforcement learning for reasoning.

Poster

P4-#5015

AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning

Yi Zhang ⋅ An Zhang ⋅ XiuYu Zhang ⋅ Leheng Sheng ⋅ Yuxin Chen ⋅ Zhenkai Liang ⋅ Xiang Wang

Large language models (LLMs), despite possessing latent safety understanding from their vast pretraining data, remain vulnerable to generating harmful content and exhibit issues such as over-refusal and utility degradation after safety alignment. Current safety alignment methods often result in superficial refusal shortcuts or rely on intensive supervision for reasoning-based approaches, failing to fully leverage the model's intrinsic safety self-awareness. We propose \textbf{AlphaAlign}, a simple yet effective pure reinforcement learning (RL) framework with verifiable safety reward designed to incentivize this latent safety awareness through proactive safety reasoning. AlphaAlign employs a dual-reward system: a verifiable safety reward encourages correctly formatted and explicitly justified refusals for harmful queries while penalizing over-refusals, and a normalized helpfulness reward guides high-quality responses to benign inputs. This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data. AlphaAlign demonstrates three key advantages: (1) Simplicity and efficiency, requiring only binary prompt safety labels and minimal RL steps for substantial improvements. (2) Breaking the safety-utility trade-off, by enhancing refusal of harmful content and reducing over-refusals, while simultaneously maintaining or even improving general task performance and robustness to unseen jailbreaks. (3) Deep alignment, fostering proactive safety reasoning that generates explicit safety rationales rather than relying on shallow refusal patterns. Our codes are available at \url{https://github.com/zy20031230/AlphaAlign}

Poster

P4-#5016

MICLIP: Learning to Interpret Representation in Vision Models

Yingdong Shi ⋅ Zhiyu Yang ⋅ Changming Li ⋅ Jingyi Yu ⋅ Kan Ren

Vision models have demonstrated remarkable capabilities, yet their decision-making processes remain largely opaque. Mechanistic interpretability (MI) offers a promising avenue to decode these internal workings. However, existing interpretation methods suffer from two key limitations. First, they rely on the flawed activation-magnitude assumption, assuming that the importance of a neuron is directly reflected by the magnitude of its activation, which ignores more nuanced causal roles. Second, they are predominantly input-centric, failing to capture the causal mechanisms that drive a model's output. These shortcomings lead to inaccurate and unreliable internal representation interpretations, especially in cases of incorrect predictions. We propose MICLIP (Mechanism-Interpretability via Contrastive Learning), a novel framework that extends CLIP’s contrastive learning to align internal mechanisms of vision models with general semantic concepts, enabling interpretable and controllable representations. Our approach circumvents previous limitations by performing multimodal alignment between a model's internal representations and both its input concepts and output semantics via contrastive learning. We demonstrate that MICLIP is a general framework applicable to diverse representation unit types, including individual neurons and sparse autoencoder (SAE) features. By enabling precise, causal-aware interpretation, MICLIP not only reveals the semantic properties of a model's internals but also paves the way for effective and targeted manipulation of model behaviors.

Poster

P4-#5017

Fast-dLLM v2: Efficient Block-Diffusion LLM

Chengyue Wu ⋅ Hao Zhang ⋅ Shuchen Xue ⋅ Shizhe Diao ⋅ Yonggan Fu ⋅ Zhijian Liu ⋅ Pavlo Molchanov ⋅ Ping Luo ⋅ Song Han ⋅ Enze Xie

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2, a carefully designed block diffusion language model (dLLM) that efficiently adapts pretrained AR models into dLLMs for parallel text generation—requiring only ∼1B tokens of fine-tuning. This represents a 500× reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens), while preserving the original model’s performance. Our approach introduces a novel training recipe that combines a block diffusion mechanism with a complementary attention mask, enabling blockwise bidirectional context modeling without sacrificing AR training objectives. To further accelerate decoding, we design a hierarchical caching mechanism: a block-level cache that stores historical context representations across blocks, and a sub-block cache that enables efficient parallel generation within partially decoded blocks. Coupled with our parallel decoding pipeline, Fast-dLLM v2 achieves up to 2.5× speedup over standard AR decoding without compromising generation quality. Extensive experiments across diverse benchmarks demonstrate that Fast-dLLM v2 matches or surpasses AR baselines in accuracy, while delivering state-of-the-art efficiency among dLLMs—marking a significant step toward the practical deployment of fast and accurate LLMs.

Poster

P4-#5018

Selection, Reflection and Self-Refinement: Revisit Reasoning Tasks via a Causal Lens

Yunlong Deng ⋅ Boyang Sun ⋅ Yan Li ⋅ Zeyu Tang ⋅ Lingjing Kong ⋅ Kun Zhang ⋅ Guangyi Chen

Due to their inherent complexity, reasoning tasks have long been regarded as rigorous benchmarks for assessing the capabilities of machine learning models, especially large language models (LLMs). Although humans can solve these tasks with ease, existing models, even after extensive pre-training and post-training at scale, still fail to perform reasoning reliably. In this paper, we revisit reasoning tasks from a causal perspective, seeking to understand their behavior in latent space and to offer insights for addressing their challenges. Specifically, we cast reasoning tasks as a selection mechanism, in which high-level logical concepts function as selection operators on the given observations, such as, identifying the correct answer in a math problem or filling the appropriate entry in Sudoku. We emphasize two key properties of this formulation that shed light on the difficulty of reasoning tasks. First, the latent space exceeds the observation space in complexity, even when the correct answer is fully determined by the observed input. Second, the latent variables, corresponding to logical thought, are densely structured and exhibit strong dependencies. Building on this formulation, we introduce a framework, called SR$^2$, that incorporates the estimated latent variables as feedback into the selection mechanism, thereby facilitating the learning of dense dependencies among latent representations. The framework consists of three key modules: reflective representation learning, dependency self-refinement, and periodic intermediate alignment. Experimentally, we show that our approach yields significant gains in reasoning accuracy, for example, attaining over 10% improvement in performance with 8$\times$ fewer parameters on the Sudoku and Maze tasks over the recent advances.

Poster

P4-#5118

Reassessing Layer Pruning in LLMs: New Insights and Methods

Yao Lu ⋅ Hao Cheng ⋅ Yujie Fang ⋅ Zeyu Wang ⋅ Jiaheng Wei ⋅ Dongwei Xu ⋅ Qi Xuan ⋅ Zhaowei Zhu

Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final layers followed by fine-tuning the lm\_head and the remaining last three layers, yields remarkably strong performance. These pruning strategies are further supported by theoretical analyses based on the gradient flow. Following this guide, our method surpasses existing state-of-the-art pruning methods by $5.62\%$–$17.27\%$ on Llama-3.1-8B-It, by $2.36\%$–$19.45\%$ on Llama-3-8B and by $4.34\%$–$9.59\%$ on Llama-3-70B. The code is available at at https://github.com/yaolu-zjut/Navigation_LLM_layer_pruning.

Blog Track Poster

P4-#5117

Revisiting the NetHack Learning Environment

Michael Matthews ⋅ Pierluca D'Oro ⋅ Anssi Kanervisto ⋅ Scott Fujimoto ⋅ Jakob Foerster ⋅ Mikael Henaff

The NetHack Learning Environment (NLE) was proposed as a challenging benchmark to test an agents abilities to perform complex reasoning over long time horizons in a stochastic, partially-observed, procedurally generated setting. To date, no approach, including those based on reinforcement learning, using large pretrained models, using handcoded symbolic agents, imitating expert trajectories or any hybrid method has achieved significant progress towards completing the game. We take a deeper look into the mechanics and interface of the NLE and show that much of the complexity of NetHack is inaccessible due to constraints on the observation and action spaces. We propose a series of modifications and show that they meaningfully improve performance on the NLE.

Journal Track Poster

P4-#5116

Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Olawale Elijah Salaudeen · Nicole Chiou · Shiny Weng · Sanmi Koyejo

Spurious correlations, unstable statistical shortcuts a model can exploit, are expected to degrade performance out-of-distribution (OOD). However, across many popular OOD generalization benchmarks, vanilla empirical risk minimization (ERM) often achieves the highest OOD accuracy. Moreover, gains in in-distribution accuracy generally improve OOD accuracy, a phenomenon termed accuracy on the line, which contradicts the expected harm of spurious correlations. We show that these observations are an artifact of misspecified OOD datasets that do not include shifts in spurious correlations that harm OOD generalization, the setting they are meant to evaluate. Consequently, current practice evaluates "robustness" without truly stressing the spurious signals we seek to eliminate; our work pinpoints when that happens and how to fix it. Contributions. (i) We derive necessary and sufficient conditions for a distribution shift to reveal a model's reliance on spurious features; when these conditions hold, "accuracy on the line" disappears. (ii) We audit leading OOD datasets and find that most still display accuracy on the line, suggesting they are misspecified for evaluating robustness to spurious correlations. (iii) We catalog the few well-specified datasets and summarize generalizable design principles, such as identifying datasets of natural interventions (e.g., a pandemic), to guide future well-specified benchmarks.

Journal Track Poster

P4-#5115

Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

Lokesh Nagalapatti · Pranava Singhal · Avishek Ghosh · Sunita Sarawagi

Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator’s influence on the learning objective based on the simulator’s relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet’s efficacy against state-of-the-art CATE baselines.

Poster

P4-#3205

Tversky Neural Networks: Psychologically Plausible Deep Learning with Differentiable Tversky Similarity

Moussa Koulako Bala Doumbouya ⋅ Dan Jurafsky ⋅ Christopher Manning

Work in psychology has highlighted that the geometric model of similarity standard in deep learning is not psychologically plausible because its metric properties such as symmetry do not align with human perception of similarity. In contrast, Tversky (1977) proposed an axiomatic theory of similarity with psychological plausibility based on a representation of objects as sets of features, and their similarity as a function of their common and distinctive features. This model of similarity has not been incorporated as a general-purpose building block in deep learning, in part because of the challenge of incorporating discrete set operations. In this paper, we develop a differentiable parameterization of Tversky's similarity that is learnable through gradient descent, and derive basic neural network building blocks such as the Tversky projection layer, which unlike the linear projection layer can model non-linear functions such as XOR. Through experiments with image recognition and language modeling neural networks, we show that the Tversky projection layer is a beneficial replacement for the linear projection layer. For instance, on the NABirds image classification task, a ResNet-50 with a Tversky projection layer trained from scratch achieves a 2.36 percentage point accuracy improvement over the linear layer baseline. With Tversky projection layers, GPT-2's perplexity on PTB decreases by 7.8%, and its parameter count by 34.8%. Finally, we propose a unified interpretation of both projection layer types as computing similarities of inputs to learned prototypes, along with a novel visualization technique. Crucially, Tversky's set-based representation enables the algebraic specification of semantic fields, which we illustrate with lexical and visual stimuli. Our work offers a new paradigm for neural networks that are not only more accurate and efficient, but also interpretable under an established theory of psychological similarity.