I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation
Fay Elhassan ⋅ David Sasu ⋅ Lars Klein ⋅ Alexandra Kulinkina ⋅ Mary-Anne Hartley
Abstract
We examine how clinicians evaluate large language models for medical use using expert feedback collected through an evaluation platform, MOOVE (Massive Open Online Validation and Evaluation). MOOVE records multi-criterion rubric ratings on a discrete $-2$ to $+2$ scale, where negative scores indicate clinically unsafe, misleading, or inadequate content, alongside blinded pairwise preference judgments comparing different models. Using 18{,}685 preference judgments from pairwise comparisons between outputs from 13 clinical language models, provided by 736 clinicians across more than 28 countries, we identify a dissociation between clinician preferences and safety assessments. Models that are frequently preferred or perform well on aggregate metrics can still exhibit substantial rates of clinically meaningful failures ($\leq -1$) in key metrics like harmfulness and accuracy. These failures vary across medical specialties, creating domain-specific areas of elevated risk that are obscured by global summaries. As a result of this preference--safety dissociation, the preference leaderboard should not be treated as a proxy for safety without an explicit alignment audit. Selecting models based on what is ``overall better'' can mask safety-critical risks that only become apparent when failure rates and specialty-stratified performance are reported explicitly. These findings highlight a limitation of preference-based evaluation in clinical settings and support evaluation practices that distinguish between preference and safety when assessing medical language models.
Chat is not available.
Successful Page Load