Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis
Abstract
Deep learning-based respiratory auscultation is currently hindered by two fundamental disconnects: the representation gap, where compressing signals into spectrograms discards transient acoustic events and clinical context, and the data gap, characterized by severe class imbalance and scarcity. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A²CA). Unlike static pipelines, the Thinker-A²CA acts as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. Under this unified orchestration, we propose two specialized architectural solutions. First, to address the representation gap, we introduce a Modality Weaving Diagnoser. This module moves beyond standard fusion by explicitly interleaving electronic health records (EHR) with audio tokens and employs Strategic Global Attention to capture long-range clinical dependencies while retaining sensitivity to millisecond-level transient events via sparse audio anchors. Second, to resolve the data gap, we design a Flow Matching Generator that retools a text-only Large Language Model (LLM) via modality injection. Guided by the Thinker-A²CA, this generator decouples pathological content from acoustic style to programmatically synthesize high-fidelity, hard-to-diagnose samples that remedy the system’s boundary errors. To support this work, we construct Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that our agentic co-design consistently outperforms prior approaches, advancing robust and deployable respiratory intelligence. Data and code will be released upon acceptance.