ICLR Poster $R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Poster

$R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang · Bo Li

Hall 3 + Hall 2B #552

[ Abstract ]

Thu 24 Apr midnight PDT — 2:30 a.m. PDT

Abstract: As large language models (LLMs) become increasingly prevalent across various applications, it is critical to establish safety guardrails to moderate input/output content of LLMs and ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories (e.g., self-harm, self-harm/instructions) independently and fail to explicitly capture the intercorrelations among them. This has led to limitations such as ineffectiveness due to inadequate training on long-tail data from correlated safety categories, susceptibility to jailbreaking attacks, and inflexibility regarding new safety categories.To address these limitations, we propose

$R^2$ -Guard, a robust reasoning enabled LLM guardrail via knowledge-enhanced logical reasoning. Specifically,

$R^2$ -Guard comprises two parts: data-driven guardrail models and reasoning components. The data-driven guardrail models provide unsafety probabilities of moderated content on different safety categories.We then encode safety knowledge among different categories as first-order logical rules and embed them into a probabilistic graphic model (PGM) based reasoning component. The unsafety probabilities of different categories from data-driven guardrail models are sent to the reasoning component for final inference. We employ two types of PGMs: Markov logic networks (MLNs) and probabilistic circuits (PCs), and optimize PCs to achieve precision-efficiency balance via improved graph structure. We also propose different methods to optimize the weights of knowledge. To further perform stress tests for guardrail models, we employ a pairwise construction method to construct a new safety benchmark TwinSafety, which features principled categories and presents new challenges for moderation. We show that

$R^2$ -Guard is effective even given unrepresentative categories or challenging jailbreaking prompts. We demonstrate the effectiveness of

$R^2$ -Guard by comparisons with eight strong guardrail models on six standard moderation datasets, and demonstrate the robustness of

$R^2$ -Guard against four SOTA jailbreaking attacks.

$R^2$ -Guard significantly surpasses SOTA method LlamaGuard by 12.6% on standard moderation datasets and by 59.9% against jailbreaking attacks.We further reveal that

$R^2$ -Guard can effectively adapt to safety category updates by simply editing the PGM reasoning graph.

Live content is unavailable. Log in and register to view live content

Poster

R2R^2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

Mintong Kang · Bo Li

Hall 3 + Hall 2B #552

$R^2$ -Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning