LLM Agentic System Safety Requires Hybrid Alignment
Abstract
Current agentic safety research prioritize the neural components of the agent system, such as training models for safe question answering. However, agent architectures require coordination between both neural components (foundation model) and symbolic components (memory systems, tools, and environments), with many alignment objectives requiring capabilities from both. Objectives like "do not facilitate weapons production" require understanding which information combinations are dangerous (a neural capability) and controlling what information is stored and provided across sessions (a symbolic capability). Neither component can satisfy such objectives alone: neural components lack visibility into cumulative patterns across sessions, while symbolic components lack the ability to assess the safety implications of the information they track. We characterize this alignment gap and demonstrate how it produces unsafe system behavior even when each component functions correctly. We then propose hybrid alignment, a framework in which neural components are trained to seek and use information from symbolic components, and symbolic components are designed to expose information that neural components need for safety reasoning. This framework requires domain expertise to specify what coordination mechanisms are appropriate. Our work establishes a new direction for agent safety research that addresses alignment as a property of neural-symbolic coordination rather than of the neural component alone.