In statistical learning theory, regularization techniques are typically leveraged to achieve the trade-off between empirical error minimization and the control of the model complexity
. In contrast to classical convex empirical risk minimization where regularization can rule out trivial solutions, regularization plays a rather different role in deep learning due to the highly non-convex optimization property
. In this paper, we firstly review two effective and prestigious regularization branches for deep neural networks that can elegantly generalize from supervised learning to semi-supervised setting.
Adversarial Training [5, 16] can provide an additional regularization beyond that provided by other generic regularization strategies, such as dropout, pretraining and model averaging. However, recent works [31, 25] demonstrated that this kind of training method holds a trade-off between the robustness and accuracy, limiting the efficacy of the adversarial regularization. In addition, Virtual Adversarial Training (VAT) 
can be regarded as a natural extension of adversarial training to semi-supervised setting without the leverage of label information by imposing local smoothness on the classifier. This strategy has achieved great success in image classification, text classification  as well as node classification . Tangent-Normal Adversarial Regularization (TNAR)  extended VAT by taking the data manifold into consideration and applied VAT along the tangent space and the orthogonal normal space of the data manifold, outperforming other state-of-the-art semi-supervised approaches.
augmented the training data by incorporating the prior knowledge that linear interpolation of input vectors should lead to linear interpolation of the associated targets, accomplishing consistent improvement of generalization on image, speech and tabular data. MixMatch extended MixUp to semi-supervised tasks by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeled data using MixUp. In contrast with VAT, MixMatch  utilizes one specific form of consistency regularization, i.e., using the standard data augmentation for images, such as random horizontal flips and crops, rather than computing adversarial perturbations to smooth the posterior distribution of the classifier.
Nevertheless, most methods for the design of regularization, including the aforementioned approaches, assume that the training samples are drawn independently and identically from an unknown data generating distribution. For instance, Support Vector Machine (SVM), Back-Propagation (BP) for Neural Networks, and many other common algorithms implicitly make this assumption as part of their derivation. However, this i.i.d. assumption is commonly violated in realistic scenarios where batches or sub-groups of training samples are likely to have internal correlations. In particular, Dundar et al. demonstrated that accounting for the correlations in real-world training data leads to statistically significant improvements in accuracy. Similarly, Peer-Regularized Networks (PeerNet)  applied graph convolutions [9, 27] to harness information from a graph of peer samples so as to improve the adversarial robustness of deep neural networks. The resulting non-local propagation in the new model acted as a strong regularization that dramatically reduce the vulnerability against adversarial attacks. Inspired by these ideas, we aim to design a general regularization strategy that can fully utilize the internal relationship between samples by explicitly constructing a graph within a batch in order to further improve the generalization in both supervised classification and semi-supervised settings.
In this paper, we propose the Patch-level Neighborhood Interpolation (Pani) for deep neural networks, serving as a simple yet effective regularization to improve the generalization of classifiers.We firstly construct a graph in each batch during mini-batch stochastic gradient decent training for deep neural networks, according to the correlation between the patch-level features in the different layers of networks rather than among samples directly. The constructed graph is expected to capture the relationship of each patch features in both input and hidden layers. Then we apply linear interpolation on the neighbors of current patch element to refine its representation by additionally leveraging the neighborhood information. Furthermore, we customize our Neighbor Interpolation Method into Virtual Adversarial Regularization and MixUp regularization respectively, resulting in Pani VAT and Pani MixUp.
For the Pani VAT, we reformulate the construction of adversarial perturbation, transforming from solely depending on the current sample to the linear combination of neighboring patch features. The resulting adversarial perturbation can leverage the information of neighboring features for all samples within a batch, thus providing more informative adversarial smoothness in semi-supervised setting. Similarly, in the Pani MixUp, we extend MixUp from image level to patch level by imposing random interpolation between patches in a neighborhood to better leverage more fine-grained supervised signal. We conduct extensive experiments to demonstrate that both of the two derived regularization strategies can outperform other state-of-the-art approaches in both supervised and semi-supervised tasks.
Our contributions can be summarized as follow:
To the best of our knowledge, we are the first to propose a general regularization method by explicitly constructing a patch-level graph that focuses on leveraging the information of correlations between samples in order to improve the generalization.
The resulting Patch-level Neighborhood Interpolation can provide a framework that can extend the current main branches of regularization, i.e., adversarial regularization and MixUp, achieving the-state-of-the-art performance over both supervised and semi-supervised settings.
Patch-level Neighborhood Interpolation paves a way toward better leveraging the neighborhood information on the design of machine learning modules.
2.1 Virtual Adversarial Training
extends the adversarial training by utilizing “virtual” adversarial perturbations to construct adversarial smoothness, obtaining effective improvement on accuracy in semi-supervised learning (SSL). Particularly, VAT replaces true labels
of samples in the formulation of adversarial training by current estimatefrom model:
where measures the divergence between two distributions and . is the adversarial perturbation depending on current sample than can further provide the smoothness in SSL. Then the VAT regularization could be derived from the inner maxization:
One elegant part of VAT is that it utilized the second-order Taylor’s expansion of virtual adversarial loss to compute the perturbation , which can be computed efficiently by power iteration with finite difference. Once the desired perturbation
has been computed, we can conduct forward and back propagation to optimize the full loss function:
where is the original supervised loss and is the hyper-parameter to control the degree of virtual adversarial smoothness.
Mixup  augmented the training data with linear interpolation on both input features and target. The resulting feature-target vectors are shown as follow:
where and are two feature-target vectors drawn randomly from the training data. and . MixUp can be understood as a form of data augmentation that encourages decision boundaries to transit linearly between classes. It is a kind of generic regularization that provides a smoother estimate of uncertainty, yielding the improvement of generalization.
2.3 Peer-Regularized Networks (PeerNet)
The centerpiece of PeerNet  is the learnable Peer Regularization (PR) layer designed to focus on improving the adversarial robustness of deep neural networks. PR layer can be flexibly added into the feature maps of deep models.
Let be matrices as the feature maps of images, where is the number of pixels and represents the dimension of each pixel. The core of PeerNet is to find the nearest neighboring pixels for each pixel among all the pixes of peer images via constructing a nearest neighbor graph in the -dimensional space. Particularly, for the -th pixel in the -th image , the -th nearest pixel neighbor can be denoted as taken from the pixel of the peer image . Then the learnable PR layer is constructed by a variant of Graph Attention Networks (GAT) :
where is the attention score determining the importance of the -th pixel of the -th peer image on the representation of current -th pixel taken from the -th image. Therefore, the resulting learnable PR layer involves non-local filtering by leveraging the wisdom of pixel neighbors from peer images, yielding robustness against adversarial attacks.
3 Patch-level Neighborhood Interpolation
Inspired by the pixel-level nearest neighbor graph in PeerNet, we propose a more general patch-level regularization that can easily extend from pixel to the whole image by adjusting the corresponding patch size. For instance, when we set patch size as 1, we in fact construct a graph based on features of each pixel, which is the same as the way of constructing a graph in PeerNet. Another flexible part of our method is that we can choose the arbitrary layer in a deep neural networks including the input layer and hidden layers. In the different layers, a flexible patch size can be chosen according to the size of receptive field in order to capture the different semantic information.
Concretely, for our Patch-level Neighborhood Interpolation (Pani) shown in Figure 1, in the first step we deploy filtering operation for the whole images in a batch to determine the candidate peer images set for each image. For example, after the filtering, the candidate set can be established for the -th image. The specific way of filtering can be achieved by retrieving the semantically nearest peer images or by random matching. In the meantime, we construct the whole patches set in the candidate peer images set by applying one special convolution to extract the corresponding patch in the different locations in an input or feature map .
Following the establishment of patch set , we construct nearest neighbor graph based on the cosine distance of patch features in order to find the neighbors of each patch in patch set for -th image, with respect to its candidate set . Mathematically, following the definition in the PeerNet, let be the -th patch on the input or feature map for the -th image within one batch. Then denote the -th nearest patch neighbor for as taken from the patch of the peer image in the candidate set .
Next, in order to leverage the knowledge from neighbors, different from graph attention mechanism in PeerNet, we apply more straightforward linear interpolation on the neighboring patches for the current patch . Then, the general formulation of our Patch-level Neighborhood Interpolation can be presented as follow:
where is the combination coefficient for the -th patch of -th image w.r.t its -th patch neighbor. The choice of linear interpolation or combination is natural and simple yet effective, as shown in our experimental part. Additionally, Eq. 6 enjoys great computational advantage compared with the expensive cost of GAT in PeerNet. Finally, after the deconvolution on all the patches with new features, we can obtain the refined representation for -th image, .
Note that our proposed method can explicitly combine the advantage of manifold regularization and non-local filtering in a flexible way, elaborated in the following.
There are a flurry of papers introducing regularization from the classical manifold learning based on the assumption that the data can be modeled as a low-dimensional manifold in the data space. More importantly, Hinton et al.  and Ioffe et al.  demonstrated regularizers that work well in the input space can also be applied to the hidden layers of a deep network, which could further improve generalization performance. Our Patch-level Neighborhood Interpolation can be easily extended from input to the hidden layers, enjoying the benefits of manifold regularization.
Non-local filters have achieved great success image processing field by additionally encoding the knowledge of neighboring pixels and their relative locations. Same as the pixel-level neighboring correlations established in PeerNet , our patch-level approach can still capture the knowledge of other neighboring patches within a batch, therefore yielding improvement of performance for the derived methods on various kinds of settings. Moreover, our Patch-level Neighborhood Interpolation can also serve as a novel non-i.i.d. regularization and can reasonably generalize well to broader settings especially when the natural correlation in the sub-group exists.
Now we customize our Patch-level Neighborhood Interpolation into adversarial and Mixup that can significantly boost their performance.
3.1 Pani VAT
Based on our patch-level framework, we can construct a novel Pani VAT that utilizes the combination or interpolation of patch neighbors for each sample to manipulate the “neighboring” perturbations, thus providing more informative adversarial smoothness in semi-supervised setting. Combining Eq. 2 and Eq. 6, we reformulate our Pani VAT with perturbations on layers in a deep neural network as follows:
where represents the classifier and denotes the input or hidden feature of input . indicates the perturbations in -th layer of network. In particularly, when , we denotes the perturbations are only imposed on input feature, which is similar to the traditional (virtual) adversarial perturbations. represents the feature map imposed by perturbation in the way shown in Eq. 6. adjusts the importance of perturbations in different layers with the overall perturbations restrained in a -ball.
Next, we can still utilize the similar power iteration and finite difference proposed in VAT  to compute the desired perturbation . Then the resulting full loss function is defined as:
where can be attained after solving the optimization problem in Eq. 7.
For the specific instantiation of our framework exhibiting in Figure 1 for our derived Pani VAT method, we present the procedure in the following:
Firstly, we construct the nearest neighbor graph on the images based on the cosine distance of second last feature through the classifier in the filter process. Construct the patch set through the convolution operation defined in a standard way.
Secondly, for the feature map on each considered layer, we still incorporate the nearest patch neighbors for each patch of each image among all the patches from the peer patches, i.e., the candidate set .
Conduct interpolation in the way shown in Eq. 6 as the non-local forward propagation.
Remark. As shown in the adversarial part of Figure 1, the rationality of our Pani VAT method lies in the fact that the constructed perturbations can entail more non-local information coming from the neighbors of current sample. Through the delicate patch-level interpolation among neighbors of each patch, the resulting virtual adversarial perturbations are expected to construct more informative directions of smoothness, thus enhancing the performance of classifier in semi-supervised setting.
3.2 Pani MixUp
To derive a fine-grained Mixup, we conduct patch-based neighborhood method from our framework. The core formulation of Pani MixUp (PMU) can be formulated as:
where denote the number of patches after filtering operation for each image and represents the importance of current element, such as image or patch while conducting MixUp. It should be noted that due to the unsymmetric property of in our framework, we should tune both the and in our experiments. For simplicity, we fix and only consider the as the hyper-parameter to pay more attention to the importance of current patch, which is inspired by the similar approach in MixMatch . For the first restraint in Eq. 9, we can achieve it through normalization according to the ratio of for the current element and for all the neighbors. Considering the physical meanings of in MixUp, we impose extra convex combination restraint, i,e, the second restriction in Eq. 9. Then the mixing patch-target vectors in the Pani MixUp method can be presented as:
The Pani MixUp applies the following procedures:
Construct the candidate set for the -th image by random matching among all the images within one batch and then construct the patch set through the convolution operation mentioned before.
For the feature map on each target layer, we consider the nearest patch neighbors for each patch of each image among all the patches from the candidate set .
Conduct MixUp in the way shown in Eq. 10 among the neighbors of each patch over all the patches and their targets.
Conduct deconvolution operation on the patch set to return new representation of original input with the corresponding mixed target. Optimize the parameters of classifier on the attained data representation.
Remark. Different from the role of in the aforementioned Pani VAT where serves as the “combinational” perturbations, in our Pani MixUp approach, the physical meaning of is the linear interpolation coefficient to conduct MixUP. However, all the two customizations can be derived from one framework, namely our Patch-level Neighborhood Interpolation .
To demonstrate the superiority of our Patch-level Neighborhood Interpolation, we conduct extensive experiments for both our Pani VAT and Pani MixUp Method on semi-supervised and supervised settings, respectively.
4.1 Pani VAT
For fair comparison especially with VAT and its variants, such as VAT + SNTG  and TNAR , we choose the standard large convolutional network as classifier as in . For the option of dataset, we focus on the standard semi-supervised setting on CIFAR-10 with 4,000 labeled data. Unless otherwise noted, all the experimental settings in our method are the identical with those in the Vanilla VAT . In particular, we conduct our Pani VAT on input layer and one additional hidden layer, yielding two variants Pani VAT (input) and Pani VAT (+hidden). In Pani VAT (input), we choose patch size as 2, peer images as the candidate set for each image, to construct the patch nearest neighbor graph, perturbation size and adjustment coefficient as 2.0 and 1.0, respectively. For our Pani VAT (+hidden) method, we let , patch size as 2 and overall perturbation size . On the considered two layers, we set as 10 and 50, the adjustment coefficient as 1 and 0.5, respectively.
|Method||CIFAR-10 4,000 labels|
|VAT + SNTG |
|Mean Teacher |
|Improved GAN |
|Tripple GAN |
|Bad GAN |
|Improved GAN + JacobRegu + tangent |
|Improved GAN + ManiReg |
|Pani VAT (input)||12.20|
|Pani VAT (+hidden)|
Table 1 presents the state-of-the-art performance achieved by Pani VAT (+hidden) compared with other baselines on CIFAR-10. We focus on the baseline methods especially along the direction of variants of VAT and refer to the results from TNAR method , the previous state-of-the-art variant of VAT that additionally leverages the data manifold to decompose the directions of virtual adversarial smoothness. It is worthy of remarking that the performance of relevant GAN-based approaches, such as Localized GAN (LGAN)  as well as TNAR, in Table 1 mainly rely on the modeling data manifold by a generative model. By contrast, our approach does not additionally depend on this requirement and can still outperform these baselines. In addition, our Pani VAT (+hidden) achieves slight improvement compared with Pani VAT (input), verifying the superiority of manifold regularization mentioned in our framework. Although Pani VAT (input), serving as an ablation study, obtains the comparable performance with TNAR, it still outperforms other baselines without the additional leverage of the modeling of data manifold.
|CIFAR-10||PreAct ResNet-18||5.43 0.16||4.24 0.16||3.93 0.12|
|12.81 0.46||9.88 0.25||8.12 0.09|
|PreActResNet-34||5.15 0.12||3.72 0.20||3.36 0.15|
|12.67 0.26||10.60 0.57||8.13 0.32|
|WideResNet-28-10||4.59 0.06||3.21 0.13||3.02 0.11|
|8.78 0.20||8.08 0.39||5.79 0.03|
|CIFAR-100||PreAct ResNet-18||24.96 0.51||22.15 0.72||20.90 0.21|
|39.64 0.65||41.96 0.27||32.03 0.34|
|PreActResNet-34||24.85 0.14||21.49 0.68||19.46 0.29|
|39.41 0.80||41.96 0.24||34.48 0.86|
|WideResNet-28-10||21.00 0.09||18.58 0.16||17.39 0.16|
|31.91 0.77||35.16 0.33||27.71 0.63|
Analysis of Computation Cost
Another noticeable advantage of our approach is the negligible increase of computation cost compared with Vanilla VAT. In particular, one crucial operation in our approach is the construction of patch set , which can be accomplished efficiently by the convolution operation. The restoration of images from constructed patches can be easily achieved by the corresponding deconvolution similarly. Additionally, the index of nearest neighbor graph can be efficiently attained through topk(number of peer images in the filter process), (number of patch neighbors), (number of layers imposed by “neighboring” perturbations) and patch size .
As shown in Figure 2, the variation of all parameters has negligible impact on the training time each epoch compared with Vanilla VAT except the number of perturbed layers. The increasing of computational cost presents an almost linear tendency with the increasing of the number of perturbed layer as the amount of floating-point calculation is proportional to the number of perturbation elements, i.e.,
, if we temporarily neglect the difference of time in the backpropagation process for different layers. Combined with results from Table1 and Figure 2, we argue that the better performance can be expected if we construct perturbations on more hidden layers at the cost of more computation.
4.2 Pani MixUp
The experimental settings in this section are strictly followed by those in Vanilla MixUp  and Vanilla MixMatch  to pursue fair comparison. We conduct supervised image classification on CIFAR-10 and CIFAR-100 datasets to further evaluate the generalization performance of Pani MixUp. In particular, we compare ERM (Empirical Risk Minimization, i.e,, normal training), MixUp training and our approach for different neural architectures: PreAct ResNet-18, PreAct ResNet-34 and WideResNet-28-10. For fair comparison with input MixUp, we conduct our approach only on input layer and the better performance can be expected naturally if we consider more layers. More specifically, in our Pani MixUp method for all neural architectures, we uniformly choose patch size 16, parameter in Beta distribution as 2.0 for the data augmentation setting while we opt patch size 8, on the settings without data augmentation.
To extend the flexibility of Pani MixUp, we additionally introduce the mask mechanism on the interpolation coefficient to random drop with certain ratio. The mask mechanism can be viewed as dropout or enforcing sparsity , which can help to abandon redundant information while conducting patch-level MixUp. We set the mask ratio as 0.6 in the data augmentation setting while fixing the ratio as 0.4 in the scenario without data augmentation.
Table 2 presents the consistent superiority of Pani MixUp over ERM (normal training) as well as Vanilla MixUp over different deep neural network architectures. It is worthy of noting that the superiority of our approach in the setting without data augmentation can be more easily observed than that with data augmentation. Another interesting phenomenon is that MixUp suffers from one kind of collapse for some deep neural networks without data augmentation as the performance of MixUp is even inferior to the ERM on CIFAR-100 without data augmentation. By contrast, our approach exhibits consistent advantage over various settings.
Analysis of Computation Cost
To provide a comprehensive understanding about the computation cost of our method, we plot the tendency between training time under 200 epoch and the test accuracy as shown in Figure 3, in which we can better observe the computational efficiency as well as the better performance of our approach. To be more specific, we choose ResNet-18 as the basic test model and conduct the experiment about the variation of test accuracy while training to compare the efficacy of different approaches. We can easily observe the consistent advantage of performance of our approach and comparable training time under the same number of epochs. One interesting point about the “collapse” phenomenon shown in the fourth subplot of Figure 3 reveals the process of this issue. After the complete of learning rate decay around 50-th epoch, the performance of MixUp surprisingly drops steadily to the final result that is even inferior to original ERM. By contrast, Neighborhood Method achieves consistent improvement on the generalization without any “collapse” issue.
Further Extension to the MixMatch
To further demonstrate the superiority of our Neighborhood Interpolation MixUp, we embed our approach into MixMatch , the current state-of-the-art approach that naturally extends MixUp to semi-supervised setting. The resulting approach, Pani MixMatch, elegantly replaces the MixUp part in the MixMatch with our Pani MixUp, thus imposing patch neighbor Mixup by additionally incorporating patch neighborhood information. Results shown in Table 3 demonstrate that Pani MixMatch can further improve the performance of MixMatch in the standard semi-supervised setting, thus verifying the effectiveness and flexibility of our Patch-level Neighborhood Interpolation.
The recent tendency of the design of regularization attaches more importance on the consistency and flexibility on various kinds of settings. For instance, Virtual Adversarial Training is a natural extension for Adversarial Training to the semi-supervised setting by constructing virtual adversarial smoothness. MixMatch unified the dominant approaches relevant to MixUp and then achieved remarkable performance on the semi-supervised scenario by simultaneously considering the MixUp operation on both labeled and unlabeled data. Along this way, we focus on the proposal of a general regularization motivated by additional leverage of neighboring information existing in the sub-group of samples, e.g., within one batch, which can elegantly extend previous prestigious regularization approaches and generalize well over both supervised and semi-supervised setting.
In this paper, we firstly analyze the benefit of leveraging non-i.i.d information while developing more efficient regularization for deep neural networks, thus proposing a general and flexible patch neighbor regularizer called Patch-level Neighborhood Interpolation by interpolating the neighborhood representation. Furthermore, we customize our Patch-level Neighborhood Interpolation into VAT and MixUp, respectively. Extensive experiments have verified the effectiveness of the two derived approaches, therefore demonstrating the benefit of our Patch-level Neighborhood Interpolation. Our work paves a way toward better understanding and leveraging the knowledge of relationship between samples to design better regularization and improve generalization over a wide range of settings.
Since the proposed Pani framework is general and flexible, more applications could be considered in the future, such as adversarial training for improving model robustness and natural language processing tasks. Also, the theoretical properties of Pani should also be analyzed.
-  David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. Conference on Neural Information Processing Systems, 2019.
-  Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdinov. Good semi-supervised learning that requires a bad gan. In Advances in Neural Information Processing Systems, pages 6510–6520, 2017.
-  Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
-  Murat Dundar, Balaji Krishnapuram, Jinbo Bi, and R Bharat Rao. Learning classifiers when the training data is not iid. In IJCAI, pages 756–761, 2007.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2014.
-  Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, 2015.
-  Konstantinos Kamnitsas, Daniel C Castro, Loic Le Folgoc, Ian Walker, Ryutaro Tanno, Daniel Rueckert, Ben Glocker, Antonio Criminisi, and Aditya Nori. Semi-supervised learning via compact latent space clustering. arXiv preprint arXiv:1806.02679, 2018.
-  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations, 2016.
-  Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised learning with gans: Manifold invariance with improved inference. In Advances in Neural Information Processing Systems, pages 5540–5550, 2017.
-  Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
-  Bruno Lecouat, Chuan-Sheng Foo, Houssam Zenati, and Vijay R Chandrasekhar. Semi-supervised learning with gans: Revisiting manifold regularization. arXiv preprint arXiv:1805.08957, 2018.
-  Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 2, 2013.
-  Chongxuan Li, Kun Xu, Jun Zhu, and Bo Zhang. Triple generative adversarial nets. arXiv preprint arXiv:1703.02291, 2017.
-  Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. Smooth neighbors on teacher graphs for semi-supervised learning. arXiv preprint arXiv:1711.00258, 2017.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations, 2017.
-  Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. International Conference on Learning Representations, 2016.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. arXiv preprint arXiv:1704.03976, 2017.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
-  Guo-Jun Qi, Liheng Zhang, Hao Hu, Marzieh Edraki, Jingdong Wang, and Xian-Sheng Hua. Global versus localized generative adversarial nets. In , 2018.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
-  Ke Sun, Zhouchen Lin, Hantao Guo, and Zhanxing Zhu. Virtual adversarial training on graph convolutional networks in node classification. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 431–443. Springer, 2019.
-  Jan Svoboda, Jonathan Masci, Federico Monti, Michael M Bronstein, and Leonidas Guibas. Peernets: Exploiting peer wisdom against adversarial attacks. International Conference on Learning Representations, 2018.
-  Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pages 1195–1204, 2017.
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
Robustness may be at odds with accuracy.International Conference on Learning Representations, 2018.
Vladimir N Vapnik and A Ya Chervonenkis.
On the uniform convergence of relative frequencies of events to their probabilities.In Measures of complexity, pages 11–30. Springer, 2015.
-  Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. International Conference on Learning Representations, 2017.
-  Bing Yu, Jingfeng Wu, Jinwen Ma, and Zhanxing Zhu. Tangent-normal adversarial regularization for semi-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10676–10684, 2019.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations, 2016.
-  Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. Conference on Neural Information Processing Systems, 2017.
-  Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. International Conference on Machine Learning, 2019.