ICLR Poster Interpreting Robustness Proofs of Deep Neural Networks

Poster

Interpreting Robustness Proofs of Deep Neural Networks

Debangshu Banerjee · Avaljot Singh · Gagandeep Singh

Halle B #200

[ Abstract ]

[ OpenReview]

Abstract:

In recent years numerous methods have been developed to formally verify the robustness of deep neural networks (DNNs). Though the proposed techniques are effective in providing mathematical guarantees about the DNNs' behavior, it is not clear whether the proofs generated by these methods are human-understandable. In this paper, we bridge this gap by developing new concepts, algorithms, and representations to generate human understandable insights into the internal workings of DNN robustness proofs. Leveraging the proposed method, we show that the robustness proofs of standard DNNs rely more on spurious input features as compared to the proofs of DNNs trained to be robust. Robustness proofs of the provably robust DNNs filter out a larger number of spurious input features as compared to adversarially trained DNNs, sometimes even leading to the pruning of semantically meaningful input features.The proofs for the DNNs combining adversarial and provably robust training tend to achieve the middle ground

Chat is not available.