Skip to yearly menu bar Skip to main content


Deep Generative Clustering with Multimodal Diffusion Variational Autoencoders

Emanuele Palumbo · Laura Manduchi · Sonia Laguna · Daphné Chopard · Julia E Vogt

Halle B #79
[ ]
Tue 7 May 1:45 a.m. PDT — 3:45 a.m. PDT


Multimodal VAEs have recently gained significant attention as generative models for weakly-supervised learning with multiple heterogeneous modalities. In parallel, VAE-based methods have been explored as probabilistic approaches for clustering tasks. At the intersection of these two research directions, we propose a novel multimodal VAE model in which the latent space is extended to learn data clusters, leveraging shared information across modalities. Our experiments show that our proposed model improves generative performance over existing multimodal VAEs, particularly for unconditional generation. Furthermore, we propose a post-hoc procedure to automatically select the number of true clusters thus mitigating critical limitations of previous clustering frameworks. Notably, our method favorably compares to alternative clustering approaches, in weakly-supervised settings. Finally, we integrate recent advancements in diffusion models into the proposed method to improve generative quality for real-world images.

Chat is not available.