Poster
R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang · Bo Li
Hall 3 + Hall 2B #552
[
Abstract
]
Thu 24 Apr midnight PDT
— 2:30 a.m. PDT
Abstract:
As large language models (LLMs) become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs and ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories (e.g., self-harm, self-harm/instructions) independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories.To address these limitations, we propose R2-Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically, R2-Guard comprises two parts: data-driven guardrail models and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories.We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. We also propose different methods to optimize the weights of knowledge. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories and presents new challenges for moderation. We show that R2-Guard is effective even given unrepresentative categories or challenging jailbreaking prompts. We demonstrate the effectiveness of R2-Guard by comparisons with eight strong guardrail models on six standard moderation datasets, and demonstrate the robustness of R2-Guard against four SOTA jailbreaking attacks. R2-Guard significantly surpasses SOTA method LlamaGuard by 12.6% on standard moderation datasets and by 59.9% against jailbreaking attacks.We further reveal that R2-Guard can effectively adapt to safety category updates by simply editing the PGM reasoning graph.
Live content is unavailable. Log in and register to view live content