SmartDJ: Declarative Audio Editing with Audio Language Model
Abstract
Audio editing plays a crucial role in VR/AR immersion, virtual conferencing, sound design, and interactive media. However, recent generative audio editing models depend on template-like instruction formats and are restricted to mono-channel audio. Moreover, existing systems require users to specify low-level editing actions, rather than expressing the desired outcome at a higher semantic level. We introduce SmartDJ, a novel framework for stereo audio editing that enables declarative audio editing, where the users describe the desired outcome while delegating the underlying editing operations to the system. Given a high-level instruction, SmartDJ decomposes it into a sequence of atomic edit operations, such as adding, removing, or spatially relocating sound events. These operations are then executed by a diffusion model trained to edit stereo audio. To enable this capability, we design a scalable data synthesis pipeline that produces paired examples of declarative instructions, atomic edit operations, and audios before and after each edit operation. Experiments demonstrate that SmartDJ achieves superior perceptual quality, spatial realism, and semantic alignment compared to prior audio editing methods.