Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy

11/21/2019 ∙ by Ke Sun, et al. ∙ Peking University 0

Regularization plays a crucial role in machine learning models, especially for deep neural networks. The existing regularization techniques mainly reply on the i.i.d. assumption and only employ the information of the current sample, without the leverage of neighboring information between samples. In this work, we propose a general regularizer called Patch-level Neighborhood Interpolation (Pani) that fully exploits the relationship between samples. Furthermore, by explicitly constructing a patch-level graph in the different network layers and interpolating the neighborhood features to refine the representation of the current sample, our Patch-level Neighborhood Interpolation can then be applied to enhance two popular regularization strategies, namely Virtual Adversarial Training (VAT) and MixUp, yielding their neighborhood versions. The first derived Pani VAT presents a novel way to construct non-local adversarial smoothness by incorporating patch-level interpolated perturbations. In addition, the Pani MixUp method extends the original MixUp regularization to the patch level and then can be developed to MixMatch, achieving the state-of-the-art performance. Finally, extensive experiments are conducted to verify the effectiveness of the Patch-level Neighborhood Interpolation in both supervised and semi-supervised settings.



There are no comments yet.


page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In statistical learning theory, regularization techniques are typically leveraged to achieve the trade-off between empirical error minimization and the control of the model complexity 


. In contrast to classical convex empirical risk minimization where regularization can rule out trivial solutions, regularization plays a rather different role in deep learning due to the highly non-convex optimization property 


. In this paper, we firstly review two effective and prestigious regularization branches for deep neural networks that can elegantly generalize from supervised learning to semi-supervised setting.

Adversarial Training [5, 16] can provide an additional regularization beyond that provided by other generic regularization strategies, such as dropout, pretraining and model averaging. However, recent works [31, 25] demonstrated that this kind of training method holds a trade-off between the robustness and accuracy, limiting the efficacy of the adversarial regularization. In addition, Virtual Adversarial Training (VAT) [19]

can be regarded as a natural extension of adversarial training to semi-supervised setting without the leverage of label information by imposing local smoothness on the classifier. This strategy has achieved great success in image classification 

[19], text classification [17] as well as node classification [22]. Tangent-Normal Adversarial Regularization (TNAR) [28] extended VAT by taking the data manifold into consideration and applied VAT along the tangent space and the orthogonal normal space of the data manifold, outperforming other state-of-the-art semi-supervised approaches.

MixUp [30]

augmented the training data by incorporating the prior knowledge that linear interpolation of input vectors should lead to linear interpolation of the associated targets, accomplishing consistent improvement of generalization on image, speech and tabular data. MixMatch 

[1] extended MixUp to semi-supervised tasks by guessing low-entropy labels for data-augmented unlabeled examples and mixing labeled and unlabeled data using MixUp. In contrast with VAT, MixMatch [1] utilizes one specific form of consistency regularization, i.e., using the standard data augmentation for images, such as random horizontal flips and crops, rather than computing adversarial perturbations to smooth the posterior distribution of the classifier.

Nevertheless, most methods for the design of regularization, including the aforementioned approaches, assume that the training samples are drawn independently and identically from an unknown data generating distribution. For instance, Support Vector Machine (SVM), Back-Propagation (BP) for Neural Networks, and many other common algorithms implicitly make this assumption as part of their derivation. However, this i.i.d. assumption is commonly violated in realistic scenarios where batches or sub-groups of training samples are likely to have internal correlations. In particular, Dundar et al. 

[4] demonstrated that accounting for the correlations in real-world training data leads to statistically significant improvements in accuracy. Similarly, Peer-Regularized Networks (PeerNet) [23] applied graph convolutions [9, 27] to harness information from a graph of peer samples so as to improve the adversarial robustness of deep neural networks. The resulting non-local propagation in the new model acted as a strong regularization that dramatically reduce the vulnerability against adversarial attacks. Inspired by these ideas, we aim to design a general regularization strategy that can fully utilize the internal relationship between samples by explicitly constructing a graph within a batch in order to further improve the generalization in both supervised classification and semi-supervised settings.

In this paper, we propose the Patch-level Neighborhood Interpolation (Pani) for deep neural networks, serving as a simple yet effective regularization to improve the generalization of classifiers.We firstly construct a graph in each batch during mini-batch stochastic gradient decent training for deep neural networks, according to the correlation between the patch-level features in the different layers of networks rather than among samples directly. The constructed graph is expected to capture the relationship of each patch features in both input and hidden layers. Then we apply linear interpolation on the neighbors of current patch element to refine its representation by additionally leveraging the neighborhood information. Furthermore, we customize our Neighbor Interpolation Method into Virtual Adversarial Regularization and MixUp regularization respectively, resulting in Pani VAT and Pani MixUp.

For the Pani VAT, we reformulate the construction of adversarial perturbation, transforming from solely depending on the current sample to the linear combination of neighboring patch features. The resulting adversarial perturbation can leverage the information of neighboring features for all samples within a batch, thus providing more informative adversarial smoothness in semi-supervised setting. Similarly, in the Pani MixUp, we extend MixUp from image level to patch level by imposing random interpolation between patches in a neighborhood to better leverage more fine-grained supervised signal. We conduct extensive experiments to demonstrate that both of the two derived regularization strategies can outperform other state-of-the-art approaches in both supervised and semi-supervised tasks.

Our contributions can be summarized as follow:

  • To the best of our knowledge, we are the first to propose a general regularization method by explicitly constructing a patch-level graph that focuses on leveraging the information of correlations between samples in order to improve the generalization.

  • The resulting Patch-level Neighborhood Interpolation can provide a framework that can extend the current main branches of regularization, i.e., adversarial regularization and MixUp, achieving the-state-of-the-art performance over both supervised and semi-supervised settings.

  • Patch-level Neighborhood Interpolation paves a way toward better leveraging the neighborhood information on the design of machine learning modules.

2 Preliminary

2.1 Virtual Adversarial Training

VAT [19]

extends the adversarial training by utilizing “virtual” adversarial perturbations to construct adversarial smoothness, obtaining effective improvement on accuracy in semi-supervised learning (SSL). Particularly, VAT replaces true labels

of samples in the formulation of adversarial training by current estimate

from model:


where measures the divergence between two distributions and . is the adversarial perturbation depending on current sample than can further provide the smoothness in SSL. Then the VAT regularization could be derived from the inner maxization:


One elegant part of VAT is that it utilized the second-order Taylor’s expansion of virtual adversarial loss to compute the perturbation , which can be computed efficiently by power iteration with finite difference. Once the desired perturbation

has been computed, we can conduct forward and back propagation to optimize the full loss function:

Figure 1: Pipeline of our Patch-level Neighborhood Interpolation followed by two derived regularization, i.e., Pani VAT and Pani MixUp. represents the perturbation constructed by our method and is the mixing coefficient pair.

where is the original supervised loss and is the hyper-parameter to control the degree of virtual adversarial smoothness.

2.2 MixUp

Mixup [30] augmented the training data with linear interpolation on both input features and target. The resulting feature-target vectors are shown as follow:


where and are two feature-target vectors drawn randomly from the training data. and . MixUp can be understood as a form of data augmentation that encourages decision boundaries to transit linearly between classes. It is a kind of generic regularization that provides a smoother estimate of uncertainty, yielding the improvement of generalization.

2.3 Peer-Regularized Networks (PeerNet)

The centerpiece of PeerNet [23] is the learnable Peer Regularization (PR) layer designed to focus on improving the adversarial robustness of deep neural networks. PR layer can be flexibly added into the feature maps of deep models.

Let be matrices as the feature maps of images, where is the number of pixels and represents the dimension of each pixel. The core of PeerNet is to find the nearest neighboring pixels for each pixel among all the pixes of peer images via constructing a nearest neighbor graph in the -dimensional space. Particularly, for the -th pixel in the -th image , the -th nearest pixel neighbor can be denoted as taken from the pixel of the peer image . Then the learnable PR layer is constructed by a variant of Graph Attention Networks (GAT) [27]:


where is the attention score determining the importance of the -th pixel of the -th peer image on the representation of current -th pixel taken from the -th image. Therefore, the resulting learnable PR layer involves non-local filtering by leveraging the wisdom of pixel neighbors from peer images, yielding robustness against adversarial attacks.

3 Patch-level Neighborhood Interpolation

Inspired by the pixel-level nearest neighbor graph in PeerNet, we propose a more general patch-level regularization that can easily extend from pixel to the whole image by adjusting the corresponding patch size. For instance, when we set patch size as 1, we in fact construct a graph based on features of each pixel, which is the same as the way of constructing a graph in PeerNet. Another flexible part of our method is that we can choose the arbitrary layer in a deep neural networks including the input layer and hidden layers. In the different layers, a flexible patch size can be chosen according to the size of receptive field in order to capture the different semantic information.

Concretely, for our Patch-level Neighborhood Interpolation (Pani) shown in Figure 1, in the first step we deploy filtering operation for the whole images in a batch to determine the candidate peer images set for each image. For example, after the filtering, the candidate set can be established for the -th image. The specific way of filtering can be achieved by retrieving the semantically nearest peer images or by random matching. In the meantime, we construct the whole patches set in the candidate peer images set by applying one special convolution to extract the corresponding patch in the different locations in an input or feature map .

Following the establishment of patch set , we construct nearest neighbor graph based on the cosine distance of patch features in order to find the neighbors of each patch in patch set for -th image, with respect to its candidate set . Mathematically, following the definition in the PeerNet, let be the -th patch on the input or feature map for the -th image within one batch. Then denote the -th nearest patch neighbor for as taken from the patch of the peer image in the candidate set .

Next, in order to leverage the knowledge from neighbors, different from graph attention mechanism in PeerNet, we apply more straightforward linear interpolation on the neighboring patches for the current patch . Then, the general formulation of our Patch-level Neighborhood Interpolation can be presented as follow:


where is the combination coefficient for the -th patch of -th image w.r.t its -th patch neighbor. The choice of linear interpolation or combination is natural and simple yet effective, as shown in our experimental part. Additionally, Eq. 6 enjoys great computational advantage compared with the expensive cost of GAT in PeerNet. Finally, after the deconvolution on all the patches with new features, we can obtain the refined representation for -th image, .

Note that our proposed method can explicitly combine the advantage of manifold regularization and non-local filtering in a flexible way, elaborated in the following.

Manifold Regularization

There are a flurry of papers introducing regularization from the classical manifold learning based on the assumption that the data can be modeled as a low-dimensional manifold in the data space. More importantly, Hinton et al. [6] and Ioffe et al. [7] demonstrated regularizers that work well in the input space can also be applied to the hidden layers of a deep network, which could further improve generalization performance. Our Patch-level Neighborhood Interpolation can be easily extended from input to the hidden layers, enjoying the benefits of manifold regularization.

Non-local Filtering

Non-local filters have achieved great success image processing field by additionally encoding the knowledge of neighboring pixels and their relative locations. Same as the pixel-level neighboring correlations established in PeerNet [23], our patch-level approach can still capture the knowledge of other neighboring patches within a batch, therefore yielding improvement of performance for the derived methods on various kinds of settings. Moreover, our Patch-level Neighborhood Interpolation can also serve as a novel non-i.i.d. regularization and can reasonably generalize well to broader settings especially when the natural correlation in the sub-group exists.

Now we customize our Patch-level Neighborhood Interpolation into adversarial and Mixup that can significantly boost their performance.

3.1 Pani VAT

Based on our patch-level framework, we can construct a novel Pani VAT that utilizes the combination or interpolation of patch neighbors for each sample to manipulate the “neighboring” perturbations, thus providing more informative adversarial smoothness in semi-supervised setting. Combining Eq. 2 and Eq. 6, we reformulate our Pani VAT with perturbations on layers in a deep neural network as follows:


where represents the classifier and denotes the input or hidden feature of input . indicates the perturbations in -th layer of network. In particularly, when , we denotes the perturbations are only imposed on input feature, which is similar to the traditional (virtual) adversarial perturbations. represents the feature map imposed by perturbation in the way shown in Eq. 6. adjusts the importance of perturbations in different layers with the overall perturbations restrained in a -ball.

Next, we can still utilize the similar power iteration and finite difference proposed in VAT [18] to compute the desired perturbation . Then the resulting full loss function is defined as:


where can be attained after solving the optimization problem in Eq. 7.


For the specific instantiation of our framework exhibiting in Figure 1 for our derived Pani VAT method, we present the procedure in the following:

  • Firstly, we construct the nearest neighbor graph on the images based on the cosine distance of second last feature through the classifier in the filter process. Construct the patch set through the convolution operation defined in a standard way.

  • Secondly, for the feature map on each considered layer, we still incorporate the nearest patch neighbors for each patch of each image among all the patches from the peer patches, i.e., the candidate set .

  • Conduct interpolation in the way shown in Eq. 6 as the non-local forward propagation.

  • We perform VAT to compute the desired under the the constraints in Eq. 7. Then we can obtain the full objective function in Eq. 8.

Remark. As shown in the adversarial part of Figure 1, the rationality of our Pani VAT method lies in the fact that the constructed perturbations can entail more non-local information coming from the neighbors of current sample. Through the delicate patch-level interpolation among neighbors of each patch, the resulting virtual adversarial perturbations are expected to construct more informative directions of smoothness, thus enhancing the performance of classifier in semi-supervised setting.

3.2 Pani MixUp

To derive a fine-grained Mixup, we conduct patch-based neighborhood method from our framework. The core formulation of Pani MixUp (PMU) can be formulated as:


where denote the number of patches after filtering operation for each image and represents the importance of current element, such as image or patch while conducting MixUp. It should be noted that due to the unsymmetric property of in our framework, we should tune both the and in our experiments. For simplicity, we fix and only consider the as the hyper-parameter to pay more attention to the importance of current patch, which is inspired by the similar approach in MixMatch [1]. For the first restraint in Eq. 9, we can achieve it through normalization according to the ratio of for the current element and for all the neighbors. Considering the physical meanings of in MixUp, we impose extra convex combination restraint, i,e, the second restriction in Eq. 9. Then the mixing patch-target vectors in the Pani MixUp method can be presented as:



from a uniform distribution or a beta distribution, subject to



The Pani MixUp applies the following procedures:

  • Construct the candidate set for the -th image by random matching among all the images within one batch and then construct the patch set through the convolution operation mentioned before.

  • For the feature map on each target layer, we consider the nearest patch neighbors for each patch of each image among all the patches from the candidate set .

  • Conduct MixUp in the way shown in Eq. 10 among the neighbors of each patch over all the patches and their targets.

  • Conduct deconvolution operation on the patch set to return new representation of original input with the corresponding mixed target. Optimize the parameters of classifier on the attained data representation.

Remark. Different from the role of in the aforementioned Pani VAT where serves as the “combinational” perturbations, in our Pani MixUp approach, the physical meaning of is the linear interpolation coefficient to conduct MixUP. However, all the two customizations can be derived from one framework, namely our Patch-level Neighborhood Interpolation .

4 Experiments

To demonstrate the superiority of our Patch-level Neighborhood Interpolation, we conduct extensive experiments for both our Pani VAT and Pani MixUp Method on semi-supervised and supervised settings, respectively.

4.1 Pani VAT

Implement Details

For fair comparison especially with VAT and its variants, such as VAT + SNTG [15] and TNAR [28], we choose the standard large convolutional network as classifier as in [19]. For the option of dataset, we focus on the standard semi-supervised setting on CIFAR-10 with 4,000 labeled data. Unless otherwise noted, all the experimental settings in our method are the identical with those in the Vanilla VAT [19]. In particular, we conduct our Pani VAT on input layer and one additional hidden layer, yielding two variants Pani VAT (input) and Pani VAT (+hidden). In Pani VAT (input), we choose patch size as 2, peer images as the candidate set for each image, to construct the patch nearest neighbor graph, perturbation size and adjustment coefficient as 2.0 and 1.0, respectively. For our Pani VAT (+hidden) method, we let , patch size as 2 and overall perturbation size . On the considered two layers, we set as 10 and 50, the adjustment coefficient as 1 and 0.5, respectively.

Method CIFAR-10 4,000 labels
VAT [18]
VAT + SNTG [15]
model [11]
Mean Teacher [24]
CCLP [8]
ALI [3]
Improved GAN [21]
Tripple GAN [14]
Bad GAN [2]
LGAN [20]
Improved GAN + JacobRegu + tangent [10]
Improved GAN + ManiReg [12]
TNAR [28]
Pani VAT (input) 12.20
Pani VAT (+hidden)
Table 1: Classification errors () of compared methods on CIFAR-10 datasets without data augmentation.

Our Results

Table 1 presents the state-of-the-art performance achieved by Pani VAT (+hidden) compared with other baselines on CIFAR-10. We focus on the baseline methods especially along the direction of variants of VAT and refer to the results from TNAR method [28], the previous state-of-the-art variant of VAT that additionally leverages the data manifold to decompose the directions of virtual adversarial smoothness. It is worthy of remarking that the performance of relevant GAN-based approaches, such as Localized GAN (LGAN) [20] as well as TNAR, in Table 1 mainly rely on the modeling data manifold by a generative model. By contrast, our approach does not additionally depend on this requirement and can still outperform these baselines. In addition, our Pani VAT (+hidden) achieves slight improvement compared with Pani VAT (input), verifying the superiority of manifold regularization mentioned in our framework. Although Pani VAT (input), serving as an ablation study, obtains the comparable performance with TNAR, it still outperforms other baselines without the additional leverage of the modeling of data manifold.

Figure 2:

Average training time each epoch with respect to parameters

, , and patch size.
Dataset Model Aug ERM Mixup() Ours(input)
CIFAR-10 PreAct ResNet-18 5.43 0.16 4.24 0.16 3.93 0.12
12.81 0.46 9.88 0.25 8.12 0.09
PreActResNet-34 5.15 0.12 3.72 0.20 3.36 0.15
12.67 0.26 10.60 0.57 8.13 0.32
WideResNet-28-10 4.59 0.06 3.21 0.13 3.02 0.11
8.78 0.20 8.08 0.39 5.79 0.03
CIFAR-100 PreAct ResNet-18 24.96 0.51 22.15 0.72 20.90 0.21
39.64 0.65 41.96 0.27 32.03 0.34
PreActResNet-34 24.85 0.14 21.49 0.68 19.46 0.29
39.41 0.80 41.96 0.24 34.48 0.86
WideResNet-28-10 21.00 0.09 18.58 0.16 17.39 0.16
31.91 0.77 35.16 0.33 27.71 0.63
Table 2: Test error in comparison with ERM, MixUp and Pani MixUp (input) across three deep neural network architectures with and without data augmentation. All results are the average ones under 5 runs. Results of MixUp on the settings without data augmentation are based on our implementation on the original code from MixUp.

Analysis of Computation Cost

Another noticeable advantage of our approach is the negligible increase of computation cost compared with Vanilla VAT. In particular, one crucial operation in our approach is the construction of patch set , which can be accomplished efficiently by the convolution operation. The restoration of images from constructed patches can be easily achieved by the corresponding deconvolution similarly. Additionally, the index of nearest neighbor graph can be efficiently attained through topk

operation in Tensorflow or Pytorch. We conduct further sensitivity analysis on the computational cost of our method with respect to other parameters, i.e.,

 (number of peer images in the filter process),  (number of patch neighbors),  (number of layers imposed by “neighboring” perturbations) and patch size .

As shown in Figure 2, the variation of all parameters has negligible impact on the training time each epoch compared with Vanilla VAT except the number of perturbed layers. The increasing of computational cost presents an almost linear tendency with the increasing of the number of perturbed layer as the amount of floating-point calculation is proportional to the number of perturbation elements, i.e.,

, if we temporarily neglect the difference of time in the backpropagation process for different layers. Combined with results from Table 

1 and Figure 2, we argue that the better performance can be expected if we construct perturbations on more hidden layers at the cost of more computation.

4.2 Pani MixUp

Implementation Details

The experimental settings in this section are strictly followed by those in Vanilla MixUp [30] and Vanilla MixMatch [1] to pursue fair comparison. We conduct supervised image classification on CIFAR-10 and CIFAR-100 datasets to further evaluate the generalization performance of Pani MixUp. In particular, we compare ERM (Empirical Risk Minimization, i.e,, normal training), MixUp training and our approach for different neural architectures: PreAct ResNet-18, PreAct ResNet-34 and WideResNet-28-10. For fair comparison with input MixUp, we conduct our approach only on input layer and the better performance can be expected naturally if we consider more layers. More specifically, in our Pani MixUp method for all neural architectures, we uniformly choose patch size 16, parameter in Beta distribution as 2.0 for the data augmentation setting while we opt patch size 8, on the settings without data augmentation.

Mask Mechanism

To extend the flexibility of Pani MixUp, we additionally introduce the mask mechanism on the interpolation coefficient to random drop with certain ratio. The mask mechanism can be viewed as dropout or enforcing sparsity , which can help to abandon redundant information while conducting patch-level MixUp. We set the mask ratio as 0.6 in the data augmentation setting while fixing the ratio as 0.4 in the scenario without data augmentation.

Figure 3: Test accuracy with respect to the training time over ERM, MixUp and our approach. indicates minutes and the leap in the middle of training comes from the learning rate decay. “with Aug” and “without Aug” denote the settings with data augmentation and without augmentation, respectively.

Our Results

Table 2 presents the consistent superiority of Pani MixUp over ERM (normal training) as well as Vanilla MixUp over different deep neural network architectures. It is worthy of noting that the superiority of our approach in the setting without data augmentation can be more easily observed than that with data augmentation. Another interesting phenomenon is that MixUp suffers from one kind of collapse for some deep neural networks without data augmentation as the performance of MixUp is even inferior to the ERM on CIFAR-100 without data augmentation. By contrast, our approach exhibits consistent advantage over various settings.

Analysis of Computation Cost

To provide a comprehensive understanding about the computation cost of our method, we plot the tendency between training time under 200 epoch and the test accuracy as shown in Figure 3, in which we can better observe the computational efficiency as well as the better performance of our approach. To be more specific, we choose ResNet-18 as the basic test model and conduct the experiment about the variation of test accuracy while training to compare the efficacy of different approaches. We can easily observe the consistent advantage of performance of our approach and comparable training time under the same number of epochs. One interesting point about the “collapse” phenomenon shown in the fourth subplot of Figure 3 reveals the process of this issue. After the complete of learning rate decay around 50-th epoch, the performance of MixUp surprisingly drops steadily to the final result that is even inferior to original ERM. By contrast, Neighborhood Method achieves consistent improvement on the generalization without any “collapse” issue.

Further Extension to the MixMatch

To further demonstrate the superiority of our Neighborhood Interpolation MixUp, we embed our approach into MixMatch [1], the current state-of-the-art approach that naturally extends MixUp to semi-supervised setting. The resulting approach, Pani MixMatch, elegantly replaces the MixUp part in the MixMatch with our Pani MixUp, thus imposing patch neighbor Mixup by additionally incorporating patch neighborhood information. Results shown in Table 3 demonstrate that Pani MixMatch can further improve the performance of MixMatch in the standard semi-supervised setting, thus verifying the effectiveness and flexibility of our Patch-level Neighborhood Interpolation.

Methods CIFAR-10
PiModel [11]
PseudoLabel [13]
Mixup [30]
VAT [18]
MeanTeacher [24]
MixMatch [1]
Pani MixMatch
Table 3: Performance of our Pani MixMatch in semi-supervised setting on CIFAR with 4000 labels. The reported result of MixMatch(ours) and Pani MixMatch is under the same random seed, coming from the median of last 20 epoch while training.

5 Discussion

The recent tendency of the design of regularization attaches more importance on the consistency and flexibility on various kinds of settings. For instance, Virtual Adversarial Training is a natural extension for Adversarial Training to the semi-supervised setting by constructing virtual adversarial smoothness. MixMatch unified the dominant approaches relevant to MixUp and then achieved remarkable performance on the semi-supervised scenario by simultaneously considering the MixUp operation on both labeled and unlabeled data. Along this way, we focus on the proposal of a general regularization motivated by additional leverage of neighboring information existing in the sub-group of samples, e.g., within one batch, which can elegantly extend previous prestigious regularization approaches and generalize well over both supervised and semi-supervised setting.

6 Conclusion

In this paper, we firstly analyze the benefit of leveraging non-i.i.d information while developing more efficient regularization for deep neural networks, thus proposing a general and flexible patch neighbor regularizer called Patch-level Neighborhood Interpolation by interpolating the neighborhood representation. Furthermore, we customize our Patch-level Neighborhood Interpolation into VAT and MixUp, respectively. Extensive experiments have verified the effectiveness of the two derived approaches, therefore demonstrating the benefit of our Patch-level Neighborhood Interpolation. Our work paves a way toward better understanding and leveraging the knowledge of relationship between samples to design better regularization and improve generalization over a wide range of settings.

Since the proposed Pani framework is general and flexible, more applications could be considered in the future, such as adversarial training for improving model robustness and natural language processing tasks. Also, the theoretical properties of Pani should also be analyzed.