ICLR 2023 Inequality phenomenon in $l_{\infty}$-adversarial training, and its unrealized threats Oral

In-Person Oral presentation / top 25% paper

Inequality phenomenon in $l_{\infty}$-adversarial training, and its unrealized threats

Ranjie Duan · YueFeng Chen · Yao Zhu · Xiaojun Jia · Rong Zhang · Hui Xue'

[ Abstract ] [ Visit Oral 6 Track 4: Applications & Social Aspects of Machine Learning & General Machine Learning ]

Abstract: The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples.However, we find inequality phenomena occur during the $l_{\infty}$-adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by $\epsilon$). We hypothesize such inequality phenomena make $l_{\infty}$-adversarially trained model less reliable than the standard trained model when few ``important features" are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb or replace important features with noise or occlusion. Experiments show that $l_{\infty}$-adversarially trained model can be easily attacked when the few important features are influenced. Our work shed light on the limitation of the practicality of $l_{\infty}$-adversarial training.

Chat is not available.