Test-Time Training Undermines Existing Safety Guardrails
Abstract
Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference. TTT has been shown to be useful in several scenarios, including few-shot learning, retrieval-augmented models, and improving performance on complex reasoning tasks. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. In this work, we identify several new threat models for TTT and demonstrate how attackers can leverage these settings to bypass safety filters. Our results show that TTT consistently improves the Attack Success Rate (ASR). These findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. Thus, we argue that establishing additional safety guidelines is essential for the secure deployment of TTT in real-world applications.