Poster
in
Workshop: I Can't Believe It's Not Better: Where Large Language Models need to improve

I Can’t Believe It’s Not Safer: Preference–Safety Disassociation in Clinical LLM Evaluation

Fay Elhassan ⋅ David Sasu ⋅ Lars Klein ⋅ Alexandra Kulinkina ⋅ Mary-Anne Hartley

Project Page [ Poster] [ OpenReview]

Abstract

We examine how clinicians evaluate large language models for medical use using expert feedback collected through an evaluation platform, MOOVE (Massive Open Online Validation and Evaluation). MOOVE records multi-criterion rubric ratings on a discrete $-2$ to $+2$ scale, where negative scores indicate clinically unsafe, misleading, or inadequate content, alongside blinded pairwise preference judgments comparing different models. Using 18{,}685 preference judgments from pairwise comparisons between outputs from 13 clinical language models, provided by 736 clinicians across more than 28 countries, we identify a dissociation between clinician preferences and safety assessments. Models that are frequently preferred or perform well on aggregate metrics can still exhibit substantial rates of clinically meaningful failures ($\leq -1$) in key metrics like harmfulness and accuracy. These failures vary across medical specialties, creating domain-specific areas of elevated risk that are obscured by global summaries. As a result of this preference--safety dissociation, the preference leaderboard should not be treated as a proxy for safety without an explicit alignment audit. Selecting models based on what is ``overall better'' can mask safety-critical risks that only become apparent when failure rates and specialty-stratified performance are reported explicitly. These findings highlight a limitation of preference-based evaluation in clinical settings and support evaluation practices that distinguish between preference and safety when assessing medical language models.

Chat is not available.