Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

PT-T2I/V: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Image/Video-Task

Jing Wang · Ao Ma · Jiasong Feng · Dawei Leng · Yuhui Yin · Xiaodan Liang

Hall 3 + Hall 2B #179
[ ] [ Project Page ]
Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract: The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to efficiently model global visual information. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the PT-T2I/V family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing computational complexity in image and video generation tasks (e.g., a reduction 59\% compared to DiT and a reduction 34\% compared to PixArt-α). The visual exhibition of and code are available at https://360cvgroup.github.io/Qihoo-T2X/.

Live content is unavailable. Log in and register to view live content