1 Introduction
Due to their superior performance, deep neural networks (DNNs) have been deployed on real systems in many fields, such as image recognition [11]
and natural language processing
[38]. Realworld systems could take inputs shifted by various perturbations, e.g., different lighting effects on an image, various ambient noise on a conversation. Those could potentially cause unreliable predictions of DNNs. In particular, crafted adversarial examples [26] can easily flip the predictions of deployed DNNs, through adding imperceptible noise to natural data. It arouses anxieties on deploying DNNs in safetycritical fields, such as autonomous driving [7] and medical images analysis [16].Recently, many efforts have been made on learning robust DNNs to resist such adversarial examples. In general, there are two broad branches in adversarial machine learning, i.e., certified robust training
[35, 30, 8, 14] and empirical robust training [17, 36, 33]. Their common purpose is to construct robust DNNs to mimic the natural occurring system (e.g., human visual system). A system is believed to be robust and invariant to adversarial perturbations since its output is smooth w.r.t. its input [32].To acquire such smoothness, we can conduct data augmentation using adversarial data [17, 33] or perturbed data with Gaussian random noise [16, 8]
during training. In this way, predictions of DNNs around the data input could be insensitive to imperceptible perturbations. Nonetheless, Dimitris et al. elucidated that adversarial robustness may be at odds with standard accuracy
[29]. To mitigate the large gap between robustness and accuracy, more welllabeled samples are needed during training [25], which also achieves the greater smoothness close to that of a natural occurring system. However, it is expensive to gather welllabeled data, not to mention largescale welllabeled data for the smoothness requirement [25]. Fortunately, this issue can be alleviated through seminal efforts [20, 5, 31], namely utilizing unlabeled data to improve adversarial robustness. Conceptually, above works consist of three components:
(b) Based on those existing training data , they annotated unlabeled data to get , where , and . The goal is to minimize the divergence (e.g., KL divergence) between and .
For example, UAT++ in [31] and SSDRL in [20] integrate parts (b) and (c) into the objective functions of DNNs, which encourages the model output on unlabeled data close to unknown groundtruth labels when number of labeled training samples is large. However, the integration limits the diversity of annotation methods (i.e., part (b)), and it only enables regularizationbased methods (e.g.,VAT [19]) to annotate extra unlabeled data. Many potential methods are excluded, such as cotraining [4, 23] and graphbased models [39]. By contrast, robust selftraining (RST) in [5] has three independent modules for parts (a)–(c), and each fungible part has its clear purpose. Thus, it is believed to be the best by the standard of modular design [34].
The purpose of part (a) is to gather qualified unlabeled data as much as possible, e.g., scratch websites for unlabeled images or collect medical images without doctor’s diagnosis. To gather such data, there are many standard methods, thus the improvement of this part is out the scope of the current paper. Meanwhile, different training methods in part (c) seem to hit their limits, which hardly narrows the gap between robust generalization and standard generalization further [25]. For example, on CIFAR10, the stateoftheart TRADES achieves the robust accuracy around 50% [36], while the standard accuracy should be above 90% [11, 37].
Thus, this motivates us to improve part (b), namely including more welllabeled data. The label quality is quite crucial to boosting the adversarial robustness. For example, in Figure 1, given the fixed amount of unlabeled data, the increased label accuracy boosts both standard accuracy and robust accuracy significantly. As the amount of unlabeled data increases, high label accuracy takes positive effects on adversarial robustness, vice versa. Meanwhile, negative effects of lowquality labels are reinforced during training. RST [5]
firstly learns a classifier based merely on labeled dataset
, then uses the learned classifier to annotate all unlabeled data with pseudo labels to get . We name this classifier predetermined annotator. The predetermined annotator does not consider the knowledge of unlabeled data. As an simple example in Figure 2 illustrates, an annotator based solely on the labeled data may give wrong labels to a large portion of unlabeled data. RST has a bottleneck that it could give poor pseudo labels to unlabeled data, and later adversarial training could be fed on many erroneous data. Even worse, its error is accumulated and reinforced over training. The quality of pseudo labels decides the success of adversarial training.Fortunately, there remains a lot of room to improve the quality of these labels. To break the bottleneck of RST [5], we leverage deep cotraining to improve the quality of pseudo labels in part (b), and thus propose robust cotraining (RCT) for adversarial learning with unlabeled data. The proposed algorithm utilizes two networks to correct the mistake of each other by getting consensus on unlabeled data. Meanwhile, each network robustly trains on adversarial examples generated by its peer network, which keeps both networks diverged in function. Our experiments confirm its effectiveness on the quality of pseudo labels, which could further boost both standard test accuracy and robust test accuracy in adversarial training. Our proposed method takes a giant leap in closing the gap between adversarially robust generalization and standard generalization.
2 Related Work
2.1 Semisupervised Deep Learning
Many works have been proposed to boost the label quality of unlabeled data largely located in the area of semisupervised learning (SSL). Selftraining
[18, 15] is one of simplest approaches in SSL. Selftraining produces pseudo labels for unlabeled data using the model itself to obtain additional training data. Unlabeled data with confident predictions are recruited into training. However, selftraining is hardly able to correct its own mistakes. If the model’s prediction on unlabeled data is confident but wrong, the wrong pseudolabeled data is forever incorporated into training and it amplifies the model error over training iterations.Multiview training aims to train multiple models with different views of the data. These view enhance each other and can help to correct other’s mistakes. The most exemplar one is cotraining [4]. To be specific, in [4] different views refer to different independent set of feature on the same data. For example, in web page classification, one set of feature is text on the webpage, another set of feature is its anchor text hyperlinks to that webpage. There are two models looking at different sets of feature. Each model are trained on its respective feature set. Over training iterations, unlabeled data with confident predictions by one model are moved to training set of its peer model.
Regularization based semisupervised learning encourages output of different perturbations of input data to be close, through adding the regularization term in the loss function. For example,
[2, 13, 27] use random perturbations and [19] uses virtual adversarial perturbations. A comprehensive review on SSL, e.g. generated model based SSL and graphbased SSL refers to [6].2.2 Adversarial Defense
Many works focus on building adversarially robust models against adversarial perturbations. In general, those are divided into two branches certified defenses and empirical defenses.
In certified defenses, the model’s prediction is expected to be unchanged for any perturbed data around its corresponding natural data. There are some exemplar works [24, 35, 8]. For example, [14, 8]
use randomized smoothing to transform base classifier to a new smoothed classifier. However, due to its strong assumption, certified robustness has difficulty in scalability in large models and high dimensional data, and suffers from low computational efficiency in its robustness certification.
Another line of defense is empirical defense. Empirical defense dynamically exploits adversarial examples and recruit them into the training along with natural data. Adversarial examples are exploited according to natural data. The network has a large loss on them, but they are visually indistinguishable with their natural data counterpart. The most exemplar ones are Madry’s adversarial training [17] and adversarial training TRADES [36].
In empirical defense, the purpose of defense is to minimize the adversarial risk, i.e.,
where denotes the true distributions over samples and denoted the allowed perturbations region of the sample point. The empirical defense is to find parameter minimize the empirical risk
where is a finite set of samples drawn i.i.d. from . To solve this minmax problem, [17] applies Danskin’s theorem [3]. At each training iteration, Madry’s adversarial training firstly exploits adversarial examples that maximize the loss and then update the classifier based on these adversarial examples, i.e.,
(1) 
where is adversarial example of within its allowed perturbation region , i.e., . The inner maximization is nonconvex optimization problem with difficulty to get its exact solution. Projected gradient descent (PGD) [17] is utilized to approximately search its local minima. is the crossentropy loss encouraging the predicted value of the adversarial example to be near the true label of its corresponding natural example .
Another exemplar work is TRADES [36]. Similar to VAT [19], they introduce a regularization term on the loss function encouraging similarity between predictions of and its adversarial example , i.e.,
(2) 
where
is the Kullback Leibler divergence which measure the prediction difference,
is crossentropy loss, and is the trade off parameter. It also uses PGD to approximately solve the inner maximization.3 Methodology
In order to achieve greater smoothness in adversarial training, three seminal works leverage largescale unlabeled data [20, 5, 31]. Figure 1 shows that, given a fixed amount of unlabeled data, both standard test accuracy and robust test accuracy can get improved when its label accuracy of pseudo labels improves. Thus, it is inevitable to require highquality pseudo labels on those unlabeled data, and part (b) plays an vital role in the success of adversarial training.
Although Carmon et al. leverage the term “robust selftraining” (RST) to characterize their algorithm [5], their actual operation for part (b) is not the conventional selftraining. Specifically, they train a classifier merely based on . Then, they use the pretrained classifier to annotate all unlabeled data in one time, which acquires pseudo labels on unlabeled data to get . We name such pretrained classifier as predetermined annotator. Finally, they jointly use dataset to train a adversarially robust deep neural network. However, is the predetermined annotator good enough to annotate unlabeled dataset ? The answer is negative.
Figure 2 shows a simple example to simulate and explain why RST is not an optimal solution. Following RST, the left panel shows the decision boundary (orange line) learning merely from the labeled dataset blue circle, red triangle. Based on , the best annotator we get is the vertical orange line. It will perfectly classify the labeled dataset with zero errors. Nonetheless, it will inevitably annotate some of unlabeled data (grey points) with wrong labels (middle panel of Figure 2). For example, if we adopted the orange line in left panel as the annotator, at least 2 grey circle points would be wrongly annotated as triangle and 2 grey triangle points wrongly annotated as circle.
To address above issues, we can train a classifier based on and . Specifically, we first train a classifier based on . Then, we use the pretrained classifier to annotate all unlabeled data , which acquires pseudo labels on unlabeled data to get . We jointly use dataset to retrain the classifier, and then reannotate via retrained classifier until the convergence (i.e., multiple times). Finally, we jointly use dataset to train a adversarially robust deep neural network. The key step is to use the retrained classifier to annotate all unlabeled data in multiple times during training.
Taking right panel of Figure 2 as an example, which learns a new decision boundary (i.e., a good annotator). This annotator utilizes labeled data together with unlabeled data (grey points), and these unlabeled data can elucidate the data distribution well. The annotator embedded with the knowledge of both labeled and unlabeled data can characterize the true distribution accurately. Thus, it will annotate groundtruth labels to those unlabeled data (grey points). To sum up, predetermined annotator (left panel of Figure 2) is not good enough to annotate unlabeled dataset , which motivates us to explore retrained annotator (Sections 3.1 and 3.2).
3.1 The Simple Realization
The top simple realization is to utilize the conventional selftraining [18, 15]. The key idea of selftraining is to utilize DNN ’s predictions on unlabeled data over training iterations, namely annotating unlabeled data in multiple times
. Specifically, if the probability of
assigned to the most likely class is higher than a predetermined threshold , is added to the training set for further training with as its pseudo label, i.e., . This process is repeated for a fixed number of iterations or until no more unlabeled data available or confident.Figure 3 empirically justifies the efficacy of selftraining (red line), which significantly improves the quality of pseudo labels compared to predetermined annotator (black line). However, there is a drawback in conventional selftraining, namely network is hardly able to correct its own mistakes. Assume that the prediction of deep networks on an unlabeled data is incorrect at the early training stage. Nonetheless, the data with incorrect pseudo label will be utilized in future training iterations. Due to memorization effect of deep networks [1], will fit the wronglylabeled data, which will hurt the test performance seriously [40]. This negative effects become even worse, when the domain of unlabeled data is different from that of labeled data [22].
To ameliorate the inferiority of selftraining, the straightforward approach is to introduce a pair of networks correcting mistakes of each other, namely vanilla cotraining [4]. Specifically, we train a pair of DNNs (i.e., and ) simultaneously. We encourage two networks making consistent predictions on unlabeled data. Meanwhile, two DNNs are feed with different orders of labeled data to keep the inconsistent pace of training. To be specific, at each training iteration with , and , two deep networks and feed forward the common unlabeled data and different labeled data and , and then update parameters and by
(3)  
(4) 
where is the learning rate, is the tradeoff parameter, is cross entropy loss for labeled data, and is JensenShannon divergence between two predicted probability between and on the same unlabeled data .
We leverage JensenShannon (JS) divergence to measure the similarity between two predicted probability between and
. The JS value is bounded and positive, and the smaller value denotes larger similarity between two probability distributions, vice versa. To minimize JS divergence between predicted probability between
and on unlabeled data , Eq. (3) and (4) encourage and making similar predictions on unlabeled data . Meanwhile, at each training iteration, two DNNs learn from different labeled data and . This will keep each other diverged. Thus, two networks and could be complementary and could help to correct its peer’s mistake on unlabeled data. Besides JensenShannon divergence, we can also use other divergences, such as KLdivergence and Hellinger distance.From Figure 2, we observe an obvious improvement by vanilla cotraining (yellow line) compared with selftraining (red line). Taking a closer look at pseudolabel accuracy on unlabeled data, we find that the result of vanilla cotraining is better than that of selftraining. We believe that the interaction between peer networks (i.e., vanilla cotraining) takes positive effects, while there is no any interaction in a single network (i.e., selftraining). This point is also supported by the philosophy of collaborative learning [9], where each member interacts with others actively by sharing experiences. Each member takes on asymmetric roles so that new knowledge can be created within members. Nonetheless, the improvement of label accuracy is not completely satisfying. When the number of unlabeled data increases from 20k to 30k, label accuracy of vanilla cotraining is similar to that of selftraining, since vanilla cotraining has the collapsing problem. Namely, two networks gradually become the same one in function, which will not be able to correct mistakes of each other on unlabeled data.
As shown in Figure 4
(yellow line), total variance of predictions
andare large before 50 epochs. The high total variance denotes that two networks are diverged in function, since two networks have different views on unlabeled data at the initial training stage. Their different views come from different initialization and orders of fetching labeled data. The benefit of such divergence is that one network could have information gain from observing its peer network. Thus, they have sufficient capacities to correct mistakes of each other on unlabeled data. However, with the increase of training epochs, total variance of two networks gradually decreases and approaches near zero after 350 epochs. It means that two networks gradually converge to the same in function, and they can not correct mistakes of each other. Thus, vanilla cotraining will gradually degenerate into selftraining, which suffers from accumulated error problem at later training epochs.
3.2 The Powerful Realization
To address the collapsing problem of vanilla cotraining, we should keep two networks diverged in function. Especially at later training epochs, we should add an extra force pulling each other away, so that they always have capacities for correcting mistakes of each other. Inspired by [23], we encourage two networks diverged by exploiting peer’s adversarial examples, namely deep cotraining.
In general, adversarial example is modified from natural example, where the network has large loss on adversarial example while it has small loss on its corresponding natural example. Adversarial example unveils the input space where the network could easily make mistakes. Such space is the weakest part (i.e., leading to unreliable prediction) corresponding to the network. Two networks can keep inconsistent from each other by learning from the weakest part of each other. Namely, each network robustly trains on adversarial examples generated by its peer network. Intuitively, each network always “looks” into peer’s weakest part. Thus, two networks could prevent themselves from collapsing into one function and constantly keep diverged.
Mathematically, at each training iteration with , and , two networks update themselves by
(5)  
(6) 
where and are weights of and , is the learning rate, , and are tradeoff parameters, is cross entropy loss for labeled data, and is JensenShannon divergence between two predicted probability between and . Most importantly, adversarial data , , and are exploited due to
(7)  
(8)  
(9)  
(10) 
where , are predicted labels by networks and on respectively, and is cross entropy loss.
Compared with vanilla cotraining, the extra loss terms underbraced in Eq. (5) and (6) are introduced in the deep cotraining. Note that , in Eq. (5) and (6) and in Eq. (7)  (10) control the “force” that pulls each other away. Specifically, and control importance of divergence term underbraced, while decides allowable size of norm ball around the natural data, where adversarial examples are generated. Increasing , and could enable more divergence between two networks. In practice, Eq. (7)  (10) are hard to be solved analytically, and thus we approximate its solution by PGD [17] or FGSM [10].
Figure 3 validates the efficacy of deep cotraining (blue line). We set , , and in Eq. (5) and (6). We utilize FGSM with single step to search for peer network’s adversarial example. For fair comparison, we set in Eq. (3) and (4) (vanilla cotraining). It empirically shows that the quality of pseudo labels by deep cotraining (blue line) is significantly higher than that of vanilla cotraining (yellow line).
To deeply understand the deep cotraining, we analyze total variance of two networks over training epochs (blue line in Figure 4). Similar to vanilla cotraining, both networks start to converge to each other. However, at late training epochs (e.g., after 300 epochs), total variance of vanilla cotraining will approach near zero. In contrast, total variance of deep cotraining can keep a positive value, since deep cotraining exploits peer’s adversarial examples and prevents two networks collapsing into the same in function. This brings us Algorithm 1 called robust cotraining, which connects the deep cotraining and adversarial training. Our proposed algorithm can empirically boost both standard accuracy and robust accuracy as follows.
Remark.
In Algorithm 1, the quality of pseudo labels on unlabeled data get improved significantly via deep cotraining. Thus, we could obtain augmented dataset by joining labeled dataset and highquality pseudolabeled dataset . Then, we train a adversarially robust deep network on using either Madry’s adversarial training (i.e., Eq. (1)) [17] or TRADES (i.e., Eq. (2.2)) [36].
4 Experiments
We conduct experiments on realworld dataset CIFAR10 and SVHN [21]. We make comparisons between our robust cotraining (RCT) and robust selftraining (RST) [5]. We show our algorithm could give better pseudo label than RST. As a result, our algorithm boosts both standard test accuracy and robust test accuracy of adversarial training by a large margin. Thus, we empirically justify our main claim: The quality improvement of pseudo labels on unlabeled data could lead to the better adversarial training.
4.1 Quality Improvement of Pseudo Labels
Deep cotraining in Algorithm 1 could achieve a significant improvement on pseudolabel accuracy. Compared to predetermined annotator used in RST, we could annotate unlabeled data more accurately. Especially, when there are more unlabeled data available, deep cotraining could further increase the quality of pseudo labels while predetermined annotator do not have such effects.
To justify these effects, we randomly select 4k training data as labeled set and simulate the remaining 4k, 8k, 16k, 32k, 40k, 46k unlabeled dataset in CIFAR10 dataset. In SVHN dataset, we randomly select 1k training data as labeled dataset and simulate the remaining 1k, 2k, 4k, 8k, 17k, 35k, 72k as unlabeled dataset .
In Figure 5, we compare pseudolabel accuracy on unlabeled data generated by predetermined annotator [5] and deep cotraining, respectively. We use CNN13 [13] as the network backbone. Specifically, predetermined annotator utilizes only a single CNN13, and deep cotraining training utilizes a pair of CNN13. For predetermined annotator, we train a single CNN13 based on merely until convergence. Then we use the converged CNN13 to yield pseudo labels on all unlabeled data (yellow line).
In deep cotraining, we learn a annotator involving unlabeled data. During training, we keep two networks diverged in function by setting and in Eq. (5) and Eq. (6), which is inspired by [23]. We cotrain two CNN13 according to the Algorithm 1, where maximum epoch , SGD with 0.9 momentum and learning rate starting from and decaying over epochs. Adversarial examples of Eq. (7)  (10) are generated by FGSM [10] with single step, and the is set to . Then, we randomly choose one of CNN13 pair as the annotator to label the unlabeled data.
Figure 5 shows the gap of pseudolabel accuracy between predetermined annotator and deep cotraining in both CIFAR10 and SVHN dataset. Specifically, predetermined annotator (yellow line) does not involve unlabeled data in the learning process. When there are more unlabeled data available, pseudolabel accuracy on unlabeled data does not increase and even decrease. Therefore, RST [5] leveraging predetermined annotator does not incorporate the knowledge of unlabeled data, and performs undesirably (Section 4.2). By comparison, deep cotraining (blue line) incorporates unlabeled data to learn the annotator. Besides, it introduces paradigm of collaborative learning to correct mistakes of each other. As a results, when there are more unlabeled data available, pseudolabel accuracy will increase correspondingly. Therefore, RCT (Algorithm 1) leveraging retrained annotator incorporates the knowledge of unlabeled data, and performs desirably (Section 4.2).
4.2 Improved Performance of Adversarial Training
We annotate unlabeled data to achieve , where is pseudo label. Note that different methods can acquire , such as predetermined annotator (in RST [5]), deep cotraining (in Algorithm 1), and experts labeling. Then, we combine 4000 labeled dataset with pseudolabeled dataset into , and conduct adversarial training based on . In Figure 6, we use adversarial training TRADES [36] (i.e., Eq. (2.2)) or Madry’s adversarial training [17] (i.e., Eq. (1)) to conduct experiments on CNN13 and ResNet10, where in Eq. (2.2) is set to 1 for all experiments by TRADES. Both Madry’s adversarial training and TRADES use PGD10 to exploit adversarial examples. For CIFAR10, and step size is 0.007. For SVHN, and step size is 0.007. Inputs are normalized between 0 and 1.
To sum up, we compare three adversarial training methods leveraging unlabeled data in Figure 6.

Supervised oracle (red line): is labeled by experts achieving 100% correct labels to all unlabeled data.

Robust selftraining (yellow line): is labeled by predetermined annotator, which provides around 73% correct labels on unlabeled CIFAR10 data and around 82% correct labels on unlabeled SVHN data (yellow line in Figure 5).

Robust cotraining (blue line): is labeled by deep cotraining. Depending on the amount of unlabeled data, deep cotraining could give around 80%  90% correct labels to unlabeled CIFAR10 data and around 89%  92% correct labels on unlabeled SVHN data (blue line in Figure 5).
To evaluate the performance, we calculate the standard test accuracy using natural test data, and robust test accuracy using its corresponding adversarial test data. Adversarial test data are generated by PGD5 and PGD20 respectively, with the and step size is 0.003. Figure 6 shows that, in terms of different datasets, adversarial training methods and network structures, the quality improvement of pseudo labels can obviously improve adversarial training, namely both standard test accuracy and robust test accuracy get improved significantly.
5 Conclusion
In this paper, we investigate the bottleneck of adversarial learning with unlabeled data, and find the affirmative answer “the quality of pseudo labels on unlabeled data". To break this bottleneck, we leverage deep cotraining to boost the quality of pseudo labels, and thus propose robust cotraining (RCT) for adversarial learning with unlabeled data. We conduct sufficient experiments on CIFAR10 and SVHN datasets. Empirical results demonstrate that RCT can significantly outperform robust selftraining (RST) in both standard test accuracy and robust test accuracy w.r.t. different datasets, different network structures, and different adversarial training. In future, we will investigate theory of RCT, and explore more robust adversarial learning methods.
Acknowledgments
MS was supported by JST CREST Grant Number JPMJCR1403.
References
 [1] (2017) A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 233–242. Cited by: §3.1.
 [2] (2014) Learning with pseudoensembles. In Advances in Neural Information Processing Systems, pp. 3365–3373. Cited by: §2.1.
 [3] (1997) Nonlinear programming. Journal of the Operational Research Society 48 (3), pp. 334–334. Cited by: §2.2.

[4]
(1998)
Combining labeled and unlabeled data with cotraining.
In
Proceedings of the eleventh annual conference on Computational learning theory
, pp. 92–100. Cited by: §1, §2.1, §3.1.  [5] (2019) Unlabeled data improves adversarial robustness. arXiv preprint arXiv:1905.13736. Cited by: §1, §1, §1, §1, Figure 3, §3, §3, §4.1, §4.1, §4.2, §4.
 [6] (2010) Semisupervised learning. 1st edition, The MIT Press. External Links: ISBN 0262514125, 9780262514125 Cited by: §2.1.

[7]
(2015)
Deepdriving: learning affordance for direct perception in autonomous driving.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 2722–2730. Cited by: §1.  [8] (2019) Certified adversarial robustness via randomized smoothing. See DBLP:conf/icml/2019, pp. 1310–1320. External Links: Link Cited by: §1, §1, §1, §2.2.
 [9] (1999) Collaborative learning: cognitive and computational approaches. advances in learning and instruction series.. ERIC. Cited by: §3.1.
 [10] (2015) Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §3.2, §4.1.

[11]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: Figure 1, §1, §1.  [12] (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1.
 [13] (2017) Temporal ensembling for semisupervised learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §2.1, §4.1.
 [14] (2019) Certified robustness to adversarial examples with differential privacy. See DBLP:conf/sp/2019, pp. 656–672. Cited by: §1, §2.2.
 [15] (2013) Pseudolabel: the simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §2.1, §3.1.

[16]
(2017)
A survey on deep learning in medical image analysis
. Medical image analysis 42, pp. 60–88. Cited by: §1, §1.  [17] (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: Figure 1, §1, §1, §1, §2.2, §2.2, §3.2, §4.2, Remark.
 [18] (2006) Effective selftraining for parsing. In Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics, pp. 152–159. Cited by: §2.1, §3.1.
 [19] (2019) Virtual adversarial training: A regularization method for supervised and semisupervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 41 (8), pp. 1979–1993. External Links: Link, Document Cited by: §1, §2.1, §2.2.
 [20] (2019) Robustness to adversarial perturbations in learning from incomplete data. arXiv preprint arXiv:1905.13021. Cited by: §1, §1, §3.
 [21] (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, Cited by: §4.
 [22] (2018) Realistic evaluation of deep semisupervised learning algorithms. In Advances in Neural Information Processing Systems, pp. 3235–3246. Cited by: §3.1.
 [23] (2018) Deep cotraining for semisupervised image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–152. Cited by: §1, §3.2, §4.1.
 [24] (2018) Certified defenses against adversarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, Cited by: §2.2.
 [25] (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pp. 5014–5026. Cited by: §1, §1.
 [26] (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 [27] (2017) Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §2.1.

[28]
(2008)
80 million tiny images: a large data set for nonparametric object and scene recognition
. IEEE transactions on pattern analysis and machine intelligence 30 (11), pp. 1958–1970. Cited by: §1.  [29] (2019) Robustness may be at odds with accuracy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §1.
 [30] (2018) LipschitzMargin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in Neural Information Processing Systems 31, pp. 6541–6550. Cited by: §1.
 [31] (2019) Are labels required for improving adversarial robustness?. arXiv preprint arXiv:1905.13725. Cited by: §1, §1, §3.
 [32] (1990) Spline models for observational data. Vol. 59, Siam. Cited by: §1.
 [33] (2019) On the convergence and robustness of adversarial training. In International Conference on Machine Learning, pp. 6586–6595. Cited by: §1, §1.
 [34] (2016) Modular programming with python. Packt Publishing Ltd. Cited by: §1.
 [35] (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. See DBLP:conf/icml/2018, pp. 5283–5292. Cited by: §1, §2.2.
 [36] (2019) Theoretically principled tradeoff between robustness and accuracy. In International Conference on Machine Learning, pp. 7472–7482. External Links: Link Cited by: Figure 1, §1, §1, §1, §2.2, §2.2, §4.2, Remark.

[37]
(2019)
Towards robust resnet: A small step but a giant leap.
In
Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence
, pp. 4285–4291. Cited by: §1.  [38] (2015) Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §1.
 [39] (2005) Learning from labeled and unlabeled data on a directed graph. In Proceedings of the 22nd international conference on Machine learning, pp. 1036–1043. Cited by: §1.
 [40] (2004) Class noise vs. attribute noise: a quantitative study. Artificial intelligence review 22 (3), pp. 177–210. Cited by: §3.1.
Comments
There are no comments yet.