Poster

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Jiaxin Wen · Vivek Hebbar · Caleb Larson · Aryan Bhatt · Ansh Radhakrishnan · Mrinank Sharma · Henry Sleight · Shi Feng · He He · Ethan Perez · Buck Shlegeris · Akbir Khan

2025 Poster

[ OpenReview]

Abstract

As large language models (LLMs) grow more powerful, they also become more difficult to trust. They could be either aligned with human intentions, or exhibit "subversive misalignment" -- introducing subtle errors that bypass safety checks. Although individual errors may not immediately cause harm, each increases the risk of an eventual safety failure. With this uncertainty, model deployment often grapples with the tradeoff between ensuring safety and harnessing the capabilities of untrusted models. In this work, we introduce the ``Diffuse Risk Management'' problem, aiming to balance the average-case safety and usefulness in the deployment of untrusted models over a large sequence of tasks. We approach this problem by developing a two-level framework: the single-task level (micro-protocol) and the whole-scenario level (macro-protocol). At the single-task level, we develop various \textit{micro}-protocols that use a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. At the whole-scenario level, we find an optimal \textit{macro}-protocol that uses an adaptive estimate of the untrusted model's risk to choose between micro-protocols. To evaluate the robustness of our method, we follow \textit{control evaluations} in a code generation testbed, which involves a red team attempting to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. Experiment results show that our approach retains 99.6\% usefulness of the untrusted model while ensuring near-perfect safety, significantly outperforming existing deployment methods. Our approach also demonstrates robustness when the trusted and untrusted models have a large capability gap. Our findings demonstrate the promise of managing diffuse risks in the deployment of increasingly capable but untrusted LLMs.

Video

Chat is not available.