Poster
in
Workshop: Principled Design for Trustworthy AI: Interpretability, Robustness, and Safety Across Modalities

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model

Yichuan Mo ⋅ Yukun Jiang ⋅ Yanbo Shi ⋅ Mingjie Li ⋅ Michael Backes ⋅ Yang Zhang ⋅ Yisen Wang

Project Page

Abstract

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs.