ICLR Poster BBCaL: Black-box Backdoor Detection under the Causality Lens

Poster

BBCaL: Black-box Backdoor Detection under the Causality Lens

Zihan Guan · Junfeng Guo · Mengxuan Hu · Sheng Li · Zhongliang Zhou · Jielu Zhang

Hall 3 + Hall 2B #537

[ Abstract ] [ Project Page ]

Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks, where attackers can inject hidden backdoors during the training stage. This poses a serious threat to the Model-as-a-Service setting, where downstream users directly utilize third-party models (e.g., HuggingFace Hub, ChatGPT). To this end, we study the inference-stage black-box backdoor detection problem in the paper, where defenders aim to build a firewall to filter out the backdoor inputs in the inference stage, with only input samples and prediction labels available. Existing investigations on this problem either rely on strong assumptions on types of triggers and attacks or suffer from poor efficiency. To build a more generalized and efficient method, we first provide a novel causality-based lens to analyze heterogeneous prediction behaviors for clean and backdoored samples in the inference stage, considering both sample-specific and sample-agnostic backdoor attacks. Motivated by the causal analysis and do-calculus in causal inference, we introduce Black-box Backdoor detection under the Causality Lens (BBCaL) which distinguishes backdoor and clean samples by analyzing prediction consistency after progressively constructing counterfactual samples. Theoretical analysis also sheds light on the effectiveness of the BBCaL. Extensive experiments on three benchmark datasets validate the effectiveness and efficiency of our method.

Live content is unavailable. Log in and register to view live content