Skip to yearly menu bar Skip to main content


Poster

Aligning Generative Denoising with Discriminative Objectives Unleashes Diffusion for Visual Perception

Ziqi Pang · Xin Xu · Yu-Xiong Wang

Hall 3 + Hall 2B #622
[ ]
Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

With success in image generation, generative diffusion models are increasingly adopted for discriminative scenarios because generating pixels is a unified and natural perception interface. Although directly re-purposing their generative denoising process has established promising progress in specialist (e.g., depth estimation) and generalist models, the inherent gaps between a generative process and discriminative objectives are rarely investigated. For instance, generative models can tolerate deviations at intermediate sampling steps as long as the final distribution is reasonable, while discriminative tasks with rigorous ground truth for evaluation are sensitive to such errors. Without mitigating such gaps, diffusion for perception still struggles on tasks represented by multi-modal understanding (e.g., referring image segmentation). Motivated by these challenges, we analyze and improve the alignment between the generative diffusion process and perception objectives centering around the key observation: how perception quality evolves with the denoising process. (1) Notably, earlier denoising steps contribute more than later steps, necessitating a tailored learning objective for training: loss functions should reflect varied contributions of timesteps for each perception task. (2) Perception quality drops unexpectedly at later denoising steps, revealing the sensitiveness of perception to training-denoising distribution shift. We introduce diffusion-tailored data augmentation to simulate such drift in the training data. (3) We suggest a novel perspective to the long-standing question: why should a generative process be useful for discriminative tasks - interactivity. The denoising process can be leveraged as a controllable user interface adapting to users' correctional prompts and conducting multi-round interaction in an agentic workflow. Collectively, our insights enhance multiple generative diffusion-based perception models without architectural changes: state-of-the-art diffusion-based depth estimator, previously underplayed referring image segmentation models, and perception generalists. Our code is available at https://github.com/ziqipang/ADDP.

Live content is unavailable. Log in and register to view live content