NurValues: Real-World Nursing Values Evaluation for Large Language Models in Clinical Context
Abstract
While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse–patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: Altruism, Human Dignity, Integrity, Justice, and Professionalism. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse–patient conflicts. The Easy-Level dataset consists of 2,200 value-aligned and value-violating instances, which are collected through a five-month longitudinal field study across three hospitals of varying tiers; The Hard-Level dataset is comprised of 2,200 dialogue-based instances that embed contextual cues and subtle misleading signals, which increase adversarial complexity and better reflect the subjectivity and bias of narrators in the context of emerging nurse-patient conflicts. We evaluate a total of 23 SoTA LLMs on their ability to align with nursing values, and find that general LLMs outperform medical ones, and Justice is the hardest value dimension. As the first real-world benchmark for healthcare value alignment, NurValues provides novel insights into how LLMs navigate ethical challenges in clinician–patient interactions.