Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training
Adversarial training has shown effectiveness and efficiency in improving the robustness of deep neural networks for image classification. For text classification, however, the discrete property of the text input space makes it hard to adapt the gradient-based adversarial methods from the image domain. Existing text attack methods, moreover, are effective but not efficient enough to be incorporated into practical text adversarial training. In this work, we propose a Fast Gradient Projection Method (FGPM) to generate text adversarial examples based on synonym substitution, where each substitution is scored by the product of gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. Empirical evaluations demonstrate that FGPM achieves similar attack performance and transferability when compared with competitive attack baselines, at the same time it is about 20 times faster than the current fastest text attack method. Such performance enables us to incorporate FGPM with adversarial training as an effective defense method, and scale to large neural networks and datasets. Experiments show that the adversarial training with FGPM (ATF) significantly improves the model robustness, and blocks the transferability of adversarial examples without any decay on the model generalization.
READ FULL TEXT