Log In Sign Up

SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by human-unaware transformations such as synonymous word substitution. For security reasons, it is of critical importance to develop models with certified robustness that can provably guarantee that the prediction is can not be altered by any possible synonymous word substitution. In this work, we propose a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness. Our method is simple and structure-free in that it only requires the black-box queries of the model outputs, and hence can be applied to any pre-trained models (such as BERT) and any types of models (world-level or subword-level). Our method significantly outperforms recent state-of-the-art methods for certified robustness on both IMDB and Amazon text classification tasks. To the best of our knowledge, we are the first work to achieve certified robustness on large systems such as BERT with practically meaningful certified accuracy.


page 1

page 2

page 3

page 4


Certified Robustness to Adversarial Word Substitutions

State-of-the-art NLP models can often be fooled by adversaries that appl...

Invariance-Aware Randomized Smoothing Certificates

Building models that comply with the invariances inherent to different d...

BEBERT: Efficient and robust binary ensemble BERT

Pre-trained BERT models have achieved impressive accuracy on natural lan...

Certified Robustness to Text Adversarial Attacks by Randomized [MASK]

Recently, few certified defense methods have been developed to provably ...

Localized Randomized Smoothing for Collective Robustness Certification

Models for image segmentation, node classification and many other tasks ...

Certified Robustness to Word Substitution Ranking Attack for Neural Ranking Models

Neural ranking models (NRMs) have achieved promising results in informat...

EFSG: Evolutionary Fooling Sentences Generator

Large pre-trained language representation models (LMs) have recently col...

1 Introduction

Deep neural networks have achieved state-of-the-art results in many NLP tasks, but also have been shown to be brittle to carefully crafted adversarial perturbations, such as replacing words with similar words  

(Alzantot et al., 2018), adding extra text  (Wallace et al., 2019), and replacing sentences with semantically similar sentences  (Ribeiro et al., 2018). These adversarial perturbations are imperceptible to humans, but can fool deep neural networks and break their performance. Efficient methods for defending these attacks are of critical importance for deploying modern deep NLP models to practical automatic AI systems.

In this paper, we focus on defending the synonymous word substitution attacking  (Alzantot et al., 2018), in which an attacker attempts to alter the output of the model by replacing words in the input sentence with their synonyms according to a synonym table, while keeping the meaning of this sentence unchanged. A model is said to be certified robust if such an attack is guaranteed to fail, no matter how the attacker manipulates the input sentences. Achieving and verifying certified robustness is highly challenging even if the synonym table used by the attacker is known during training (see Jia et al., 2019), because it requires to check every possible synonymous word substitution, whose number is exponentially large.

Various defense methods against synonymous word substitution attacks have been developed  (e.g., Wallace et al., 2019; Ebrahimi et al., 2018), most of which, however, are not certified robust in that they may eventually be broken by stronger attackers. Recently, Jia et al. (2019); Huang et al. (2019) proposed the first certified robust methods against word substitution attacking. Their methods are based on the interval bound propagation (IBP) method (Dvijotham et al., 2018) which computes the range of the model output by propagating the interval constraints of the inputs layer by layer.

However, the IBP-based methods of Jia et al. (2019); Huang et al. (2019) are limited in several ways. First, because IBP only works for certifying neural networks with continuous inputs, the inputs in Jia et al. (2019) and Huang et al. (2019)

are taken to be the word embedding vectors of the input sentences, instead of the discrete sentences. This makes it inapplicable to character-level 

(Zhang et al., 2015) and subword-level (Bojanowski et al., 2017) model, which are more widely used in practice (Wu et al., 2016).

In this paper, we propose a structure-free

certified defense method that applies to arbitrary models that can be queried in a black-box fashion, without any requirement on the model structures. Our method is based on the idea of randomized smoothing, which smooths the model with random word substitutions build on the synonymous network, and leverage the statistical properties of the randomized ensembles to construct provably certification bounds. Similar ideas of provably certification using randomized smoothing have been developed recently in deep learning

(e.g., Cohen et al., 2019; Salman et al., 2019; Zhang et al., 2020; Lee et al., 2019)

, but mainly for computer vision tasks whose inputs (images) are in a continuous space

(Cohen et al., 2019). Our method admits a substantial extension of the randomized smoothing technique to discrete and structured input spaces for NLP.

We test our method on various types of NLP models, including text CNN (Kim, 2014), Char-CNN (Zhang et al., 2015), and BERT (Devlin et al., 2019). Our method significantly outperforms the recent IBP-based methods (Jia et al., 2019; Huang et al., 2019) on both IMDB and Amazon text classification. In particular, we achieve an 87.35% certified accuracy on IMDB by applying our method on the state-of-the-art BERT, on which previous certified robust methods are not applicable.

2 Adversarial Word Substitution

In a text classification task, a model maps an input sentence to a label in a set of discrete categories, where is a sentence consisting of words. In this paper, we focus on adversarial word substitution in which an attacker arbitrarily replaces the words in the sentence by their synonyms according to a synonym table to alert the prediction of the model. Specifically, for any word , we consider a pre-defined synonym set that contains the synonyms of (including itself). We assume the synonymous relation is symmetric, that is, is in the synonym set of all its synonyms. The synonym set can be built based on GLOVE (Pennington et al., 2014).

With a given input sentence ,…, , the attacker may construct an adversarial sentence by perturbing at most words in X to any of their synonyms ,

where denotes the candidate set of adversarial sentences available to the attacker. Here is the Hamming distance, with the indicator function. It is expected that all have the same semantic meaning as X for human readers, but they may have different outputs from the model. The goal of the attacker is to find such that .

Certified Robustness

Formally, a model is said to be certified robust against word substitution attacking on an input X if it is able to give consistently correct predictions for all the possible word substitution perturbations, i.e,

,     for all (1)

where denotes the true label of sentence X. Deciding if is certified robust can be highly challenging, because, unless additional structural information is available, it requires to exam all the candidate sentences in , whose size grows exponentially with . In this work, we mainly consider the case when , which is the most challenging case.

3 Certifying Smoothed Classifiers

Our idea is to replace

with a more smoothed model that is easier to verify by averaging the outputs of a set of randomly perturbed inputs based on random word substitutions. The smoothed classifier

is constructed by introducing random perturbations on the input space,


is a probability distribution on the input space that prescribes a random perturbation around

X. For notation, we define

which is the “soft score” of class under .

The perturbation distribution should be chosen properly so that forms a close approximation to the original model (i.e., ), and is also sufficiently random to ensure that is smooth enough to allow certified robustness (in the sense of Theorem 1 below).

In our work, we define

to be the uniform distribution on a set of random word substitutions. Specifically, let

be a perturbation set for word in the vocabulary, which is different from the synonym set . In this work, we construct based on the top

nearest neighbors under the cosine similarity of GLOVE vectors, where

is a hyperparameter that controls the size of the perturbation set; see Section

4 for more discussion on .

For a sentence , the sentence-level perturbation distribution is defined by randomly and independently perturbing each word to a word in its perturbation set with equal probability, that is,

where is the perturbed sentence and denotes the size of . Note that the random perturbation Z and the adversarial candidate are different.

[scale = 0.55]img/figure_new3.pdf

Figure 1: A pipeline of the proposed robustness certification approach.

3.1 Certified Robustness

We now discuss how to certify the robustness of the smoothed model . Recall that is certified robust if for any , where is the true label. A sufficient condition for this is

where the lower bound of on is larger than the upper bound of on for every . The key step is hence to calculate the upper and low bounds of for and , which we address in Theorem 1 below. All proofs are in Appendix A.2.

Theorem 1.

(Certified Lower/Upper Bounds) Assume the perturbation set is constructed such that for every word and its synonym . Define

where indicates the overlap between the two different perturbation sets. For a given sentence , we sort the words according to , such that . Then

where . Equivalently, this says

The idea is that, with the randomized smoothing, the difference between and is at most for any adversarial candidate . Therefore, we can give adversarial upper and lower bounds of by , which, importantly, avoids the difficult adversarial optimization of on , and instead just needs to evaluate at the original input X.

We are ready to describe a practical criterion for checking the certified robustness.

Proposition 1.

For a sentence X and its label , we define

Then under the condition of Theorem 1, we can certify that for any if


Therefore, certifying whether the model gives consistently correct prediction reduces to checking if

is positive, which can be easily achieved with Monte Carlo estimation as we show in the sequel.

Estimating and

Recall that . We can estimate with a Monte Carlo estimator , where are i.i.d. samples from . And

can be approximated accordingly. Using concentration inequality, we can quantify the non-asymptotic approximation error. This allows us to construct rigorous statistical procedures to reject the null hypothesis that

is not certified robust at X (i.e., ) with a given significance level (e.g., ). See Appendix A.1 for the algorithmic details of the testing procedure.

We can see that our procedure is structure-free in that it only requires the black-box assessment of the output of the random inputs, and does not require any other structural information of and , which makes our method widely applicable to various types of complex models.


A key question is if our bounds are sufficiently tight. The next theorem shows that the lower/upper bounds in Theorem 1 are tight and can not be further improved unless further information of the model or is acquired.

Theorem 2.

(Tightness) Assume the conditions of Theorem 1 hold. For a model that satisfies and as defined in Proposition 1, there exists a model such that its related smoothed classifier satisfies for and , and

where is defined in Theorem 1.

In other words, if we access only through the evaluation of and , then the bounds in Theorem 1 are the tightest possible that we can achieve, because we can not distinguish between and the in Theorem 2 with the information available.

3.2 Practical Algorithm

Figure 1 visualizes the pipeline of the proposed approach. Given the synonym sets , we generate the perturbation sets from it. When an input sentence X arrives, we draw perturbed sentences from and average their outputs to estimate , which is used to decide if the model is certified robust for X.

Training the Base Classifier

Our method needs to start with a base classifier . Although it is possible to train using standard learning techniques, the result can be improved by considering that the method uses the smoothed , instead of . To improve the accuracy of , we introduce a data augmentation induced by the perturbation set. Specifically, at each training iteration, we first sample a mini-batch of data points (sentences) and randomly perturbing the sentences using the perturbation distribution . We then apply gradient descent on the model based on the perturbed mini-batch. Similar training procedures were also used for Gaussian-based random smoothing on continuous inputs (see e.g., Cohen et al., 2019).

Our method can easily leverage powerful pre-trained models such as BERT. In this case, BERT is used to construct feature maps and only the top layer weights are finetuned using the data augmentation method.

4 Experiments

We test our method on both IMDB (Maas et al., 2011) and Amazon (McAuley, 2013) text classification tasks, with various types of models, including text CNN (Kim, 2014), Char-CNN (Zhang et al., 2015) and BERT (Devlin et al., 2019). We compare with the recent IBP-based methods (Jia et al., 2019; Huang et al., 2019) as baselines. Text CNN  (Kim, 2014) was used in Jia et al. (2019) and achieves the best result therein. All the baseline models are trained and tuned using the schedules recommended in the corresponding papers. We consider the case when during attacking, which means all words in the sentence can be perturbed simultaneously by the attacker. Code for reproducing our results can be found in

Synonym Sets

Similar to Jia et al. (2019); Alzantot et al. (2018), we construct the synonym set of word to be the set of words with cosine similarity in the GLOVE vector space. The word vector space is constructed by post-processing the pre-trained GLOVE vectors (Pennington et al., 2014) using the counter-fitted method (Mrkšić et al., 2016) and the “all-but-the-top” method (Mu and Viswanath, 2018) to ensure that synonyms are near to each other while antonyms are far apart.

Perturbation Sets

We say that two words and are connected synonymously if there exists a path of words , such that all the successive pairs are synonymous. Let to be the set of words connected to synonymously. Then we define the perturbation set to consist of the top words in with the largest GLOVE cosine similarity if , and set if . Here is a hyper-parameter that controls the size of and hence trades off the smoothness and accuracy of . We use by default and investigate its effect in Section 4.2.

Evaluation Metric

We evaluate the certified robustness of a model on a dataset with the certified accuracy (Cohen et al., 2019), which equals the percentage of data points on which is certified robust, which, for our method, holds when can be verified.

4.1 Main Results

We first demonstrate that adversarial word substitution is able to give strong attack in our experimental setting. Using IMDB dataset, we attack the vanilla BERT (Devlin et al., 2019) with the adversarial attacking method of Jin et al. (2020). The vanilla BERT achieves a clean accuracy (the testing accuracy on clean data without attacking), but only a adversarial accuracy (the testing accuracy under the particular attacking method by Jin et al. (2020)). We will show later that our method is able to achieve certified accuracy and thus the corresponding adversarial accuracy must be higher or equal to .

We compare our method with IBP  (Jia et al., 2019; Huang et al., 2019). in Table 1. We can see that our method clearly outperforms the baselines. In particular, our approach significantly outperforms IBP on Amazon by improving the baseline to .

Thanks to its structure-free property, our algorithm can be easily applied to any pre-trained models and character-level models, which is not easily achievable with Jia et al. (2019) and Huang et al. (2019). Table  2 shows that our method can further improve the result using Char-CNN (a character-level model) and BERT (Devlin et al., 2019), achieving an certified accuracy on IMDB. In comparison, the IBP baseline only achieves a certified accuracy under the same setting.

.9 Method IMDB Amazon Jia et al. (2019) 79.74 14.00 Huang et al. (2019) 78.74 12.36 Ours 81.16 24.92

Table 1: The certified accuracy of our method and the baselines on the IMDB and Amazon dataset.

.9 Method Model Accuracy Jia et al. (2019) CNN 79.74 Huang et al. (2019) CNN 78.74 Ours CNN 81.16 Char-CNN 82.03 BERT 87.35

Table 2: The certified accuracy of different models and methods on the IMDB dataset.

4.2 Trade-Off between Clean Accuracy and Certified Accuracy

We investigate the trade-off between smoothness and accuracy while tuning in Table 3. We can see that the clean accuracy decreases when increases, while the gap between the clean accuracy and certified accuracy, which measures the smoothness, decreases when increases. The best certified accuracy is achieved when .

20 50 100 250 1000
Clean (%) 88.47 88.48 88.09 84.83 67.54
Certified (%) 65.58 77.32 81.16 79.98 65.13
Table 3: Results of the smoothed model with different on IMDB using text CNN. “Clean” represents the accuracy on the clean data without adversarial attacking and “Certified” the certified accuracy.

5 Conclusion

We proposed a robustness certification method, which provably guarantees that all the possible perturbations cannot break down the system. Compared with previous work such as Jia et al. (2019); Huang et al. (2019), our method is structure-free and thus can be easily applied to any pre-trained models (such as BERT) and character-level models (such as Char-CNN).

The construction of the perturbation set is of critical importance to our method. In this paper, we used a heuristic way based on the synonym network to construct the perturbation set, which may not be optimal. In further work, we will explore more efficient ways for constructing the perturbation set. We also plan to generalize our approach to achieve certified robustness against other types of adversarial attacks in NLP, such as the out-of-list attack. An naïve way is to add the “OOV” token into the synonyms set of every word, but potentially better procedures can be further explored.


This work is supported in part by NSF CRII 1830161 and NSF CAREER 1846421.


  • M. Alzantot, Y. Sharma, A. Elgohary, B. Ho, M. Srivastava, and K. Chang (2018) Generating natural language adversarial examples. In ACL, Cited by: §1, §1, §4.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. In ACL, Cited by: §1.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Cited by: Appendix B, §1, §3.2, §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §4.1, §4.1, §4.
  • K. Dvijotham, S. Gowal, R. Stanforth, R. Arandjelovic, B. O’Donoghue, J. Uesato, and P. Kohli (2018) Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265. Cited by: §1.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2018) Hotflip: white-box adversarial examples for text classification. In ACL, Cited by: §1.
  • P. Huang, R. Stanforth, J. Welbl, C. Dyer, D. Yogatama, S. Gowal, K. Dvijotham, and P. Kohli (2019) Achieving verified robustness to symbol substitutions via interval bound propagation. In EMNLP, Cited by: §1, §1, §1, §4.1, §4.1, Table 1, Table 2, §4, §5.
  • R. Jia, A. Raghunathan, K. Göksel, and P. Liang (2019) Certified robustness to adversarial word substitutions. In EMNLP, Cited by: Appendix B, §1, §1, §1, §1, §4, §4.1, §4.1, Table 1, Table 2, §4, §5.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2020) Is bert really robust? natural language attack on text classification and entailment. In AAAI, Cited by: §4.1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In AAAI, Cited by: §1, §4.
  • G. Lee, Y. Yuan, S. Chang, and T. Jaakkola (2019) Tight certificates of adversarial robustness for randomly smoothed classifiers. In NeurIPS, Cited by: §1.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    In ACL, Cited by: Appendix B, §4.
  • J. McAuley (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM RecSys, Cited by: Appendix B, §4.
  • N. Mrkšić, D. O. Séaghdha, B. Thomson, M. Gašić, L. Rojas-Barahona, P. Su, D. Vandyke, T. Wen, and S. Young (2016) Counter-fitting word vectors to linguistic constraints. In NAACL, Cited by: §4.
  • J. Mu and P. Viswanath (2018) All-but-the-top: simple and effective postprocessing for word representations. In ICLR, Cited by: §4.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §2, §4.
  • M. T. Ribeiro, S. Singh, and C. Guestrin (2018) Semantically equivalent adversarial rules for debugging nlp models. In ACL, Cited by: §1.
  • H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In NeurIPS, Cited by: §1.
  • E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh (2019) Universal adversarial triggers for nlp. In EMNLP, Cited by: §1, §1.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    arXiv preprint arXiv:1609.08144. Cited by: §1.
  • D. Zhang, M. Ye, C. Gong, Z. Zhu, and Q. Liu (2020) Black-box certification with randomized smoothing: a functional optimization based framework. arXiv preprint arXiv:2002.09169. Cited by: §1.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NIPS, Cited by: §1, §1, §4.

Appendix A Appendix

a.1 Bounding the Error of Monte Carlo Estimation

As shown in Proposition 1, the smoothed model is certified robust at an input X in the sense of (1) if

where is the true label of X, and

Assume is an i.i.d. sample from . By Monte Carlo approximation, we can estimate for all jointly, via

and estimate via

To develop a rigorous procedure for testing , we need to bound the non-asymptotic error of the Monte Carlo estimation, which can be done with a simple application of Hoeffding’s concentration inequality and union bound.

Proposition 2.

Assume is i.i.d. drawn from . For any , with probability at least , we have

We can now frame the robustness certification problem into a hypothesis test problem. Consider the null hypothesis and alternatively hypothesis :

Then according to Proposition 2, we can reject the null hypothesis with a significance level if

In all the experiments, we set and .

a.2 Proof of the Main Theorems

In this section, we give the proofs of the theorems in the main text.

a.2.1 Proof of Proposition 1

According to the definition of , it is certified robust at X, that is, for , if



Therefore, must imply (3) and hence certified robustness.

a.2.2 Proof of Theorem 1

Our goal is to calculate the upper and lower bounds and . Our key idea is to frame the computation of the upper and lower bounds into a variational optimization.

Lemma 1.

Define to be the set of all bounded functions mapping from to , For any , define

Then we have for any X and ,

Proof of Lemma 1.

The proof is straightforward. Define . Recall that

Therefore, satisfies the constraints in the optimization, which makes it obvious that

Taking on both sides yields the lower bound. The upper bound follows the same derivation. ∎

Therefore, the problem reduces to deriving bounds for the optimization problems.

Theorem 3.

Under the assumptions of Theorem 1, for the optimization problems in Lemma 1, we have

where is the quantity defined in Theorem 1 in the main text.

Now we proceed to prove Theorem 3.

Proof of Theorem 3.

We only consider the minimization problem because the maximization follows the same proof. For notation, we denote . Applying the Lagrange multiplier to the constraint optimization problem and exchanging the min and max, we have

Here and is the counting measure and . Now we calculate .

Lemma 2.

Given , define , and . We have the following identity

As a result, under the assumption that for every word and its synonym , we have

We now need to solve the optimization of .

Lemma 3.

For any word , define . For a given sentence , we define an ordering of the words such that for any . For a given X and , we define an adversarial perturbed sentence where

Then for any , we have that is the optimal solution of , that is,

Now by Lemma 3, the lower bound becomes


where is consistent with the definition in Theorem 1:

Here equation (4) is by calculation using the assumption of Theorem 1. The optimization of in (4) is an elementary step: if , we have with solution ; if , we have with solution . This finishes the proof of the lower bound. The proof the upper bound follows similarly. ∎

Proof of Lemma 2

Notice that we have

Also notice that ; ; and . Plugging in the above value, we have

And also,

Plugging in the above value, we have