Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks
Abstract
Large Language Models have the potential to expand access to relevant clinical information by introducing new ways of interacting with medical knowledge in natural language. However, their deployment in medical question-answering settings is high-stakes, since misaligned outputs can lead to severe patient harm. AI Control is an emerging approach that uses external safeguards to mitigate unsafe behaviours in misaligned systems and has been shown to be effective in domains such as code generation. However, its applicability and effectiveness in medical settings have not been systematically studied. In this work, we present a pipeline for evaluating AI Control protocols in the medical domain. To this end, we introduce HealthBench-DDI, an extension of HealthBench-Consensus and the first AI Control evaluation benchmark for biomedical question-answering, enriching 3,512 multi-turn medical conversations with drug-drug interaction information derived from official FDA labels. Furthermore, motivated by the medical domain, we argue for reformulating safety to account for the severity of unsafe outputs, rather than solely their probability. Under this revised definition, we show that existing control protocols can be subverted and propose severity-based monitoring to address this limitation.