ICLR Poster Illusory Attacks: Information-theoretic detectability matters in adversarial attacks

Spotlight Poster

Illusory Attacks: Information-theoretic detectability matters in adversarial attacks

Tim Franzmeyer · Stephen McAleer · Joao F. Henriques · Jakob Foerster · Philip Torr · Adel Bibi · Christian Schroeder de Witt

Halle B #175

[ Abstract ]

[ OpenReview]

Abstract: Autonomous agents deployed in the real world need to be robust against adversarial attacks on sensory inputs. Robustifying agent policies requires anticipating the strongest attacks possible.We demonstrate that existing observation-space attacks on reinforcement learning agents have a common weakness: while effective, their lack of information-theoretic detectability constraints makes them \textit{detectable} using automated means or human inspection. Detectability is undesirable to adversaries as it may trigger security escalations.We introduce \textit{\eattacks{}}, a novel form of adversarial attack on sequential decision-makers that is both effective and of

ϵ -

$\epsilon-$ bounded statistical detectability. We propose a novel dual ascent algorithm to learn such attacks end-to-end.Compared to existing attacks, we empirically find \eattacks{} to be significantly harder to detect with automated methods, and a small study with human participants\footnote{IRB approval under reference R84123/RE001} suggests they are similarly harder to detect for humans. Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses. The project website can be found at https://tinyurl.com/illusory-attacks.

Chat is not available.