Poster
Storybooth: Training-Free Multi-Subject Consistency for Improved Visual Storytelling
Jaskirat Singh · Junshen K Chen · Jonas Kohler · Michael Cohen
Hall 3 + Hall 2B #93
[
Abstract
]
Sat 26 Apr midnight PDT
— 2:30 a.m. PDT
Abstract:
Consistent text-to-image generation depicting the *same* subjects across different images has gained significant recent attention due to its widespread applications in the fields of visual-storytelling and multiple-shot video generation. While remarkable, existing methods often require costly finetuning for each subject and struggle to maintain consistency across multiple characters. In this work, we first analyse the reason for these limitations. Our exploration reveals that the primary-issue stems from *self-attention leakage*, which is exacerbated when trying to ensure consistency across multiple-characters. Motivated by these findings, we next propose a simple yet effective *training and optimization-free approach* for improving multiple-character consistency. In particular, we first leverage multi-modal *chain-of-thought* reasoning in order to *apriori* localize the different subjects across the storyboard frames. The final storyboard images are then generated using a modified diffusion model which includes *1) a bounded cross-attention layer* for ensuring adherence to the initially predicted layout, and *2) a bounded cross-frame self-attention layer* for reducing inter-character attention leakage. Furthermore, we also propose a novel *cross-frame token-merging layer* which allows for improved fine-grain consistency for the storyboard characters. Experimental analysis reveals that proposed approach is not only ×30 faster than prior training-based methods (*eg, textual inversion, dreambooth-lora*) but also surpasses the prior *state-of-the-art*, exhibiting improved multi-character consistency and text-to-image alignment performance.
Live content is unavailable. Log in and register to view live content