ICLR Poster HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining

Poster

HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining

Minjae Jeong · Yechan Hwang · Jaejin Lee · Sungyoon Jung · Won Hwa Kim

Hall 3 + Hall 2B #153

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Text-to-motion generation has significant potential in a wide range of applications including animation, robotics, and AR/VR. While recent works on masked motion models are promising, the task remains challenging due to the inherent ambiguity in text and the complexity of human motion dynamics. To overcome the issues, we propose a novel text-to-motion generation framework that integrates two key components: Hard Token Mining (HTM) and a Hierarchical Generative Masked Motion Model (HGM³). Our HTM identifies and masks challenging regions in motion sequences and directs the model to focus on hard-to-learn components for efficacy. Concurrently, the hierarchical model uses a semantic graph to represent sentences at different granularity, allowing the model to learn contextually feasible motions. By leveraging a shared-weight masked motion model, it reconstructs the same sequence under different conditioning levels and facilitates comprehensive learning of complex motion patterns. During inference, the model progressively generates motions by incrementally building up coarse-to-fine details. Extensive experiments on benchmark datasets, including HumanML3D and KIT-ML, demonstrate that our method outperforms existing methods in both qualitative and quantitative measures for generating context-aware motions.

Live content is unavailable. Log in and register to view live content