Skip to yearly menu bar Skip to main content


Poster

CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation

Gaojie Lin · Jianwen Jiang · Chao Liang · Tianyun Zhong · Jiaqi Yang · Zerong Zheng · Yanbo Zheng

[ ]
Fri 25 Apr 7 p.m. PDT — 9:30 p.m. PDT
 
Oral presentation: Oral Session 6F
Sat 26 Apr 12:30 a.m. PDT — 2 a.m. PDT

Abstract:

Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. While breakthroughs have been made in driving human animation through various modalities for portraits, most of current solutions for human body animation still focus on video-driven methods, leaving audio-driven taking body generation relatively underexplored. In this paper, we introduce CyberHost, a one-stage audio-driven talking body generation framework that addresses common synthesis degradations in half-body animation, including hand integrity, identity consistency, and natural motion.CyberHost's key designs are twofold. Firstly, the Region Attention Module (RAM) maintains a set of learnable, implicit, identity-agnostic latent features and combines them with identity-specific local visual features to enhance the synthesis of critical local regions. Secondly, the Human-Prior-Guided Conditions introduce more human structural priors into the model, reducing uncertainty in generated motion patterns and thereby improving the stability of the generated videos.To our knowledge, CyberHost is the first one-stage audio-driven human diffusion model capable of zero-shot video generation for the human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects. CyberHost can also be extended to video-driven and audio-video hybrid-driven scenarios, achieving similarly satisfactory results.

Live content is unavailable. Log in and register to view live content