Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Deep Generative Models for Highly Structured Data

UNCONDITIONAL IMAGE-TEXT PAIR GENERATION WITH MULTIMODAL CROSS QUANTIZER

Hyungyung Lee · Sungjin Park · Edward Choi


Abstract:

Though deep generative models have gained a lot of attention, most of the existing works are designed for the unimodal generation task. In this paper, we explore a new method for unconditional image-text pair generation. We propose MXQ-VAE, a vector quantization method for multimodal image-text representation. MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space, so that the image-text pair can be converted to a sequence of unified indices. Then we can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation. Extensive experimental results demonstrate that our approach effectively generates semantically consistent image-text pair and also enhances meaningful alignment between image and text.

Chat is not available.