Slot attention has shown remarkable object-centric representation learning performance in computer vision tasks without requiring any supervision. Despite of its object-centric binding ability brought by compositional modelling, as a deterministic module, slot attention lacks the ability to generate novel scenes. In this paper, we propose the Slot-VAE, a generative model that integrates slot attention with the hierarchical VAE framework for object-centric structured image generation. From each image, the model simultaneously infers a global scene representation to capture high-level scene structure and object-centric slot representations to embed individual object components. During generation, slot representations are generated from global scene representation to ensure coherent scene structure. Our experiments demonstrate that Slot-VAE achieves better scene structure accuracy and sample quality compared to slot-based baselines.