# Drop-Activation: Implicit Parameter Reduction and Harmonic Regularization

Overfitting frequently occurs in deep learning. In this paper, we propose a novel regularization method called Drop-Activation to reduce overfitting and improve generalization. The key idea is to drop nonlinear activation functions by setting them to be identity functions randomly during training time. During testing, we use a deterministic network with a new activation function to encode the average effect of dropping activations randomly. Experimental results on CIFAR-10, CIFAR-100, SVHN, and EMNIST show that Drop-Activation generally improves the performance of popular neural network architectures. Furthermore, unlike dropout, as a regularizer Drop-Activation can be used in harmony with standard training and regularization techniques such as Batch Normalization and AutoAug. Our theoretical analyses support the regularization effect of Drop-Activation as implicit parameter reduction and its capability to be used together with Batch Normalization.

There are no comments yet.

## Authors

• 4 publications
• 1 publication
• 17 publications
• ### On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

Deep feedforward neural networks with piecewise linear activations are c...
08/03/2015 ∙ by Zhibin Liao, et al. ∙ 0

• ### Gradient Acceleration in Activation Functions

Dropout has been one of standard approaches to train deep neural network...
06/26/2018 ∙ by Sangchul Hahn, et al. ∙ 0

• ### AL2: Progressive Activation Loss for Learning General Representations in Classification Neural Networks

The large capacity of neural networks enables them to learn complex func...
03/07/2020 ∙ by Majed El Helou, et al. ∙ 0

• ### Regularizing CNNs with Locally Constrained Decorrelations

Regularization is key for deep learning since it allows training more co...
11/07/2016 ∙ by Pau Rodríguez, et al. ∙ 0

• ### Exploiting the Full Capacity of Deep Neural Networks while Avoiding Overfitting by Targeted Sparsity Regularization

Overfitting is one of the most common problems when training deep neural...
02/21/2020 ∙ by Karim Huesmann, et al. ∙ 57

• ### Batch-normalized Maxout Network in Network

This paper reports a novel deep architecture referred to as Maxout netwo...
11/09/2015 ∙ by Jia-Ren Chang, et al. ∙ 0

• ### Animated Drag and Drop Interaction for Dynamic Multidimensional Graphs

In this paper, we propose a new drag and drop interaction technique for ...
02/05/2019 ∙ by Benjamin Renoust, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Convolution neural network (CNN) is a powerful tool for computer vision tasks. With the help of gradually increasing depth and width, CNNs [5, 6, 7, 22, 19] gain a significant improvement in image classification problems by capturing multiscale features [24]. However, when the number of trainable parameters are far more than that of training data, deep networks may suffer from overfitting. This leads to the routine usage of regularization methods such as data augmentation [2], weight decay [10], Dropout [15] and Batch Normalization [9] to prevent overfitting and improve generalization.

Although regularization has been an essential part in deep learning, deciding which regularization methods to use remains an art. Even if each of the regularization methods works well on its own, combining them together does not always give improved performance. For instance, the network trained with both Dropout and Batch Normalization may not produce a better result [9]

. Dropout may change the statistical variance of layers output when we switch from training to testing, while Batch Normalization requires the variance to be the same during both stages

[12].

### 1.1 Our contributions

To deal with the aforementioned challenges, we propose a novel regularization method, Drop-Activation, inspired by the works in [15, 3, 8, 21, 17], where some structures of networks are dropped to achieve better generalization. The advantages are as follows:

• Drop-Activation provides an easy-to-implement yet effective method for regularization via implicit parameter reduction.

• Drop-Activation can be used in synergy with most popular architectures and regularization methods, leading to improved performance in various datasets.

The basic idea of Drop-Activation is that the nonlinearities in the network will be randomly activated or deactivated during training. More precisely, the nonlinear activations are turned into identity mappings with a certain probability, as shown in Figure

1. At testing time, we propose using a deterministic neural network with a new activation function which is a convex combination of identity mapping and the dropped nonlinearity, in order to represent the ensemble average of the random networks generated from Drop-Activation.

The starting point of Drop-Activation is to randomly ensemble a large class of neural networks with either an identify or a ReLU activation function. The training process of Drop-Activation is to identify a set of parameters such that various neural networks in this class work well when assigned with these parameters, which prevents the overfitting of a fixed neural network. Drop-Activation can also be understood as adding noise to the training process for regularization. Indeed, our theoretical analysis will show that Drop-Activation implicitly adds a penalty term to the loss function, aiming at network parameters such that the corresponding deep neural network can be approximated by a shallower neural network,

i.e., implicit parameter reduction.

### 1.2 Organizations

The remainder of this paper is structured as the following. In Section 2, we review some of the regularization methods and discuss their relations to our work. In Section 3, we formally introduce Drop-Activation. In Section 4, we demonstrate the regularization of Drop-Activation and its synergy with other regularization approaches on the datasets CIFAR-10, CIFAR-100, SVHN, and EMNIST. In Section 5, these advantages of Drop-Activation are further supported by our theoretical analyses.

## 2 Related work

Various regularization methods have been proposed to reduce the risk of overfitting. Data augmentation achieves regularization by directly enlarging the original training dataset via randomly transforming the input images [11, 14, 3, 2] or output labels [25, 18]. Another class of methods regularize the network by adding randomness into various neural network structures such as nodes [15], connections [17], pooling layers [23], activations [20] and residual blocks [4, 8, 21]. In particular [15, 3, 8, 21, 17] add randomness by dropping some structures of neural networks in training. We focus on reviewing this class of methods as they are most relevant to our method where the nonlinear activation functions are discarded randomly.

Dropout [15] drops nodes along with its connection with some fixed probability during training. DropConnect [17] has a similar idea but masks out some weights randomly. [8] improves the performance of ResNet [5] by dropping entire residual block at random during training and passing through skip connections (identity mapping) . The randomness of dropping entire block enables us to train a shallower network in expectation. This idea is also used in [21] when training ResNeXt [19] type 2-residual-branch network. The idea of dropping also arises in data augmentation. Cutout [3] randomly cut out a square region of training images. In other words, they drop the input nodes in a patch-wise fashion, which prevents the neural network model from putting too much emphasis on the specific region of features.

In the next section, inspired by the above methods, we propose the Drop-Activation method for regularization. We want to emphasize that the improvement by Drop-Activation is universal to most neural-network architectures, and it can be readily used in conjunction with other regularizers without conflicts.

## 3 Drop-Activation

This section describes the Drop-Activation method. Suppose

is an input vector of an

-layer feed forward network. Let be the output of -th layer. is the element-wise nonlinear activation operator that maps an input vector to an output vector by applying a nonlinearity on each of the entries of the input. Without the loss of generality, we assume , e.g.,

 f(x)=⎡⎢ ⎢⎣σ(x[1])⋮σ(x[d])⎤⎥ ⎥⎦∈Rd,x=⎡⎢ ⎢⎣x[1]⋮x[d]⎤⎥ ⎥⎦∈Rd, (1)

where

could be a rectified linear unit (ReLU), a sigmoid or a tanh function. For standard fully connected or convolution network, the

-dimensional output can be written as

 xl+1=f(Wlxl), (2)

where is the weight matrix of the -th layer. Biases are neglected for the convenience of presentation.

In what follows, we modify the way of applying the nonlinear activation operator in order to achieve regularization. In the training phase, we remove the pointwise nonlinearities in randomly. In the testing phase, the function is replaced with a new deterministic one.

Training Phase: During training, the nonlinearities in the operator are kept with probability (or dropping them with probability ). The output of the -th layer is thus

 xl+1=(I−P)Wlxl+Pf(Wlxl)=(I−P+Pf)(Wlxl), (3)

where ,

are independent and identical random variables following a Bernoulli distribution

that takes value with probability and with probability . We use

to denote the identity matrix. Intuitively, when

, then , meaing all the nonlinearities in this layer are kept. When , then , meaning all the nonlinearities are dropped. The general case lies somewhere between these two limits where the nonlinearities are kept or dropped partially. At each iteration, a different realization of is sampled from the Bernoulli distribution again.

If the nonlinear activation function in Eqn. (3) is ReLU, the -th component of can be written as

 (I−P+Pf)(x)[j]={x[j],x[j]≥0,(1−Pj)x[j],x[j]<0. (4)

Testing Phase: During testing, we use a deterministic nonlinear function resulting from averaging the realizations of . More precisely, we take the expectation of the Eqn. (3) with respect to the random variable :

 xl+1=EPi∼ B(p)(I−P+Pf)(Wlxl)=((1−p)I+pf)(Wlxl), (5)

and the new activation function is the convex combination of an identity operator and an activation operator . Eqn. (5) is the deteministic nonlinearity used to generate a deterministic neural network for testing. In particular, if ReLU is used, then the new activation is the Leaky ReLU with slope [20].

## 4 Experiments

In this section, we empirically evaluate the performance of Drop-Activation and demonstrate its effectiveness. We apply Drop-Activation to modern deep neural architectures such as ResNet [5], PreResNet [6], DenseNet [7], ResNeXt [19], and WideResNet [22] on a series of data sets including CIFAR-10, CIFAR-100 [10], SVHN [13] and EMNIST [1]. This section is organized as follows. Section 4.1 contains basic experiment setting. In Section 4.2, we introduce the datasets and implementation details. In section 4.3, we present the numerical results.

### 4.1 Experiment Design

Our experiments are to demonstrate the following points:

1. Comparison with RReLU: Due to the similarity between the activation function used in our proposed method and randomized leaky rectified linear units (RReLU) in Eqn. (4), one may speculate that the use of RReLU gives similar performance. We show that this is indeed not the case by comparing Drop-Activation with the use of RReLU.

2. Improvement to modern neural network architectures: We show the improvement that Drop-Activation brings is rather universal by applying it to different modern network architectures on a variety of datasets.

3. Compatibility with other approaches: We show that Drop-Activation is compatible with other popular regularization methods by combining them in different network architectures.

#### 4.1.1 Comparison with RReLU

Xu et al. proposed RReLU [20] with the following training scheme for an input vector ,

 RReLU(x)[j]={x[j], x[j]≥0,Ujx[j], x[j]<0, (6)

where

is a random variable with a uniform distribution

with . In the case of ReLU in Drop-Activation, a comparison between Eqn. (4) with Eqn. (6) shows that the main difference between our approach and RReLU is the random variable used on the negative axis. It can be seen from Eqn. (6) that RReLU passes the negative data with a random shrinking rate, while Drop-Activation randomly lets the complete information pass. We compare Drop-Activation with RReLu using architectures like ResNet, PreResNet, and WideResNet on CIFAR-10 and CIFAR-100. The parameters and in RReLU are set at 1/8 and 1/3 respectively, as suggested in [20].

#### 4.1.2 Improvement to modern neural network architectures

The residual-type neural network structures greatly facilitate the optimization for deep neural network [5] and are employed by ResNet [5], PreResNet [6], DenseNet [7], ResNeXt [19], and WideResNet [22]. We demonstrate that Drop-Activation works well with these modern architectures. Moreover, since these networks use Batch Normalization to accelerate training and may contain Dropout to improve generalization (WideResNet), these experiments also show the ability of Drop-Activation to work in synergy with the prevalent training techniques. When applying Drop-Activation to these models, we directly substitute the original ReLU function with (4) during training time and Leaky ReLU with slope during testing.

#### 4.1.3 Compatibility with other regularization approaches

To further show that Drop-Activation can cooperate well with other training techniques, we combine Drop-Activation with two other popular data augmentation approaches: Cutout [3] and AutoAugment [2]

. Cutout randomly masks a square region of training data and AutoAugment uses reinforcement learning to obtain an improved data augmentation scheme. We implement Drop-Activation in combination with Cutout and AutoAugment on WideResNet and ResNet for CIFAR-100.

### 4.2 Datasets and implementation details

#### 4.2.1 Choosing probability of retaining activation:

In our method, the only parameter that needs to be tuned is the probability

of retaining activation. To get a rough estimate of what

is, we train a simple network on CIFAR-10 and perform a grid search for on the interval , with a step size equals to . When in Drop-Activation, the nonlinearity is just the standard ReLU. The simple network consists of the following layers: We first stack three blocks, and each block contains convolution with filter, Batch Normalization, ReLU, and average pooling. These are followed by two fully connected layers. Figure 2 shows the testing error on CIFAR-10 versus , which is minimal at . Each data point is averaged over the outcomes of trained neural-networks. Based on this observation, we choose for all experiments.

CIFAR: Both CIFAR-10 and CIFAR-100 contain 60k color nature images of size 32 by 32. There are 50k images for training and 10k images for testing. CIFAR-10 has ten classes of objects and 6k for each class. CIFAR-100 is similar to CIFAR-10, except that it includes 100 classes and 600 images for each class. Normalization and standard data augmentation (random cropping and horizontal flipping) are applied to the training data as [5]. For CIFAR-10, we train the models ResNet-110, PreResNet-164, DenseNet-BC-100-12, DenseNet-BC-190-40, ResNeXt-864d, WideResNet-28-10. For CIFAR-100, the models that we train are the same as in the case for CIFAR-10 except that ResNet-110 is replaced with ResNet-164 using the bottleneck layer in [6]. We use the same hyper-parameters as in the original papers except that the batch-size for DenseNet-BC-190-40 is set to . The models are optimized using SGD with momentum of [16].

SVHN: The dataset of Street View House Numbers (SVHN) contains ten classes of color digit images of size 32 by 32. There are about 73k training images, 26k testing images, and additional 531k images. The training and additional images are used together for training, so there are totally over 600k images for training. An image in SVHN may contain more than one digit, and the recognition task is to identify the digit in the center of the image. We preprocess the images following [22]. The pixel values of the images are rescaled to , and no data augmentation is applied. For this dataset, we train the models WideResNet-16-8, DenseNet-BC-100-12, ResNeXt-864d. We train WideResNet-16-8 and DenseNet-BC-100-12 as in [22, 7]

. For ResNeXt, we train it for 100 epochs, where the learning rate is initially set to 0.1 and decreases by a factor of 10 after the 40th and the 70th epoch. The rest of the hyper-parameters are set in the same way as

[19] when training ResNeXt on CIFAR-10.

EMNIST: EMNIST is a set of grayscale images containing handwritten English characters and digits. There are six different splits in this dataset and we use the split Balanced. In Balanced, there are 131,600 images in total, including 112,800 for training and 18,800 for testing. There are 47 distinct classes. For this classification task, we train the models ResNet-164, PreResNet-164, WideResNet-20-10, DenseNet-BC-100-12, and ResNeXt-864d using the hyper-parameter settings for training CIFAR-100 in [5, 6, 22, 7, 19] respectively.

### 4.3 Experiment Result

Table 1, 2, 3, and 4 show the testing error on CIFAR-100, CIFAR-10, SVHN and EMNIST, respectively. The baseline results are from original networks without Drop-Activation. In what follows, we discuss how our results support the points raised in Section 4.1.

#### 4.3.1 Comparison with RReLU

As shown in Table 1 and Table 2, RReLU may have worse performance than the baseline method, e.g., in the case of WideResNet. However, Drop-Activation consistently results in superior performance over RReLU and almost all baseline methods. Although Drop-Activation can not reduce the testing error of ResNeXt-864d, Drop-Activation with DenseNet-BC-190-40 has the best testing error smaller than that of the original ResNeXt-864d.

#### 4.3.2 Application to modern models:

As shown in Table 1 and 3, Drop-Activation improves the testing accuracy consistently comparing to Baseline for CIFAR-100 and SVHN. The conclusion remains the same in Table 2 and 4 for CIFAR-10 and EMNIST except for one case in CIFAR-10 and one case of EMNIST. But the magnitude of deterioration is relatively small. In particular, Drop-Activation improves ResNet, PreResNet and WideResNet by reducing the relative test error for CIFAR-10, CIFAR-100 or SVHN by over 3.5%.

Therefore, Drop-Activation can work with most modern networks for different datasets. Besides, our results implicitly show that Drop-Activation is compatible with regularization techniques such as Batch Normalization or Dropout used in training these networks.

#### 4.3.3 Compatibility with other regularization approaches:

We apply Drop-Activation to network models that use Cutout or AutoAugment. As shown in Table 5 and 6, Drop-Activation can further improve ResNet-18 and WideResNet-20-10 with Cutout or AutoAugment by decreasing over 0.5% test error. To the best of our knowledge, AutoAugment achieves the state-of-art result on the dataset CIFAR-100 using PyramidNet+ShakeDrop [21]. Due to the limitation of computing resource, the models with PyramidNet+ShakeDrop+DropAct and other possible combinations are still under training.

## 5 Theoretical Analysis

In Section 5.1, we show that in a neural-network with one-hidden-layer, Drop-Activation provides a regularization to the network via penalizing the difference between a deep and shallow network, which can be understood as implicit parameter reduction, i.e., the intrinsic dimension of the parameter space is smaller than the original parameter space. In Section 5.2, we further show that the use of Drop-Activation does not impact some other techniques such as Batch Normalization, which ensures the practicality of using Drop-Activation.

### 5.1 Drop-Activation as a regularizer

In this section, we show that having Drop-Activation in a standard one-hidden layer fully connected neural network with ReLU activation gives rise to an explicit regularizer.

Let be the input vector,

be the target output. The output of the one-hidden layer neural network with ReLU activation is

, where , are weights of the network, is the function for applying ReLU elementwise to the input vector. Let denotes the leaky ReLU with slope in the negative part.

As in Eqn. (3) and (5), applying Drop-Activation to this network gives

 ^y=W2((I−P+Pr)W1x) (7)

during training, and

 ^y=W2((1−p)I+pr)W1x=W2rp(W1x) (8)

during testing.

Suppose we have training samples . To reveal the effect of Drop-Activation, we average the training loss function over :

 minW1,W2n∑i=1E∥W2[(I−P+Pr)W1xi]−yi∥22, (9)

where the expectation is taken with respect to the feature noise . The use of Drop-Activation can be seen as applying a stochastic minimization to such an average loss. The result after averaging the loss function over is summarized as follows.

###### Property 5.1

The optimization problem (9) is equivalent to

 minW1,W2n∑i=1∥W2rp(W1xi)−yi∥22+1−pp∥W2W1xi−W2rp(W1xi)∥22. (10)

Proof of Property 5.1 can be found in the Supplementary Material. The first term is nothing but the loss during prediction time , where ’s are defined via (8). Therefore, Property 5.1 shows that Drop-Activation incurs a penalty

 1−pp∥W2W1xi−W2rp(W1xi)∥22 (11)

on top of the prediction loss. In Eqn. (11), the coefficient influences the magnitude of the penalty. In our experiments, is selected to be a large number that is close to (typically ), resulting a rather small regularization.

The penalty (11) consists of the terms and . Since has no nonlinearity, it can be viewed as a shallow network. In contrast, since has the nonlinearity , it can be considered as a deep network. The two networks share the same parameters and . Therefore the penalty (11) encourages weights such that the prediction of the relatively deep network should be somewhat close to that of a shallow network. In a classification or regression task, the shallow network has less representation power. However, the lower parameter complexity of the shallow network results in mappings with better generalization property. In this way, the penalty incurs by Drop-Activation may help in reducing overfitting by implicit parameter reduction.

To illustrate this point, we perform a simple regression task for two functions. In Figure (a)a, the ground truth function (Blue) is . To generate the training dataset, we sample 20 ’s on the interval , and let

 yi=xisinxi+ϵ,ϵ∼N(0,1),i=1,…,20. (12)

Then we train a fully connected network with three hidden layers of width 1000, 800, 200, respectively. As shown in Figure (a)a, the network with normal ReLU has a low prediction error on training data points, but is generally erroneous in other regions. Although the network with Drop-Activation does not fit as well to the training data (comparing with using normal ReLU), overall it achieves a better prediction error. In Figure (b)b, we show the regression results for the piecewise constant function, which can be viewed as a one-dimensional classification problem. We again see that the network using normal ReLU has large test error near the left and right boundaries, where there are less training data. However, with the incurred penalty (11), the network with Drop-Activation yields a smooth curve. Furthermore, Drop-Activation reduces the influence of data noise.

In another experiment, we train ResNet-164 on CIFAR-100 to demonstrate the regularization property of Drop-Activation. In Figure 4, the training error with Drop-Activation is slightly larger than that of without Drop-Activation. However, in terms of generalization error, Drop-Activation gives improved performance. This verifies that the original network has been over-parametired and Drop-Activation is able to regularize the network by implicit parameter reduction.

### 5.2 Compatibility of Drop-Activation with Batch Normalization

In this section, we show theoretically that Drop-Activation essentially keeps the statistical property of the output of each network layer when going from training to testing phase and hence it can be used together with Batch Normalization. [12] argues that Batch Normalization assumes the output of each layer has the same variance during training and testing. However, dropout will shift the variance of the output during testing time leading to disharmony when used in conjunction with Batch Normalization. Using a similar analysis as [12], we show that unlike dropout, Drop-Activation can be used together with Batch-Normalization since it maintains the output variance.

To this end, we analyze the mappings in ResNet [5]. Figure 5 (Left) shows a basic block of ResNet while Figure 5 (Right) shows a basic block with Drop-Activation. We focus on the rectangular box with dashed line. Suppose the output from the shown in Figure 5 is , where are i.i.d. random variables. When

is passed to the Drop-Activation layer followed by a linear transformation

with weights , we obtain

 Xtrain:=d∑i=1wi((1−Pi)x[i]+Pir(x[i])), (13)

where and . Similarly, during testing, taking the expectation of (13) over ’s gives

 Xtest:=d∑i=1wi((1−p)x[i]+pr(x[i])). (14)

The output of the rectangular box (and during testing) is then used as the input to in Figure 5. Since for Batch Normalization we only need to understand the entry-wise statistics of its input, without loss of generality, we assume the linear transformation maps a vector from to , and are scalars.

We want to show and have similar statistics. By design, . Notice that the expectation here is taken with respect to both the random variables and the input of the box in Figure 5. Thus the main question is whether the variances of and are the same. To this end, we introduce the shift ratio [12]:

 Shift ratio=Var(Xtest)Var(Xtrain).

as a metric for evaluating the variance shift. The shift ratio is expected to be close to , since the Batch Normalization layer requires its input having similar variance in both training and testing time.

###### Property 5.2

Let and be defined in Eqn. (13) and Eqn. (14). Then the shift ratio of and is

 Var(Xtest)Var(Xtrain)=(12−12π)p2−p+11−12p−12πp2. (15)

The proof of Property 5.2 is provided in the Supplementary Material. In Eqn. (15), the range of the shift ratio lies on the interval . In particular, when , , therefore is close to . This shows that in Drop-Activation, the difference in the variance of inputs to a Batch Normalization layer between the training and testing phase is rather minor.

We further demonstrate numerically that Drop-Activation does not generate an enormous shift in the variance of the internal covariates when going from the training time to the testing time. We train ResNet-164 with CIFAR-100 and let the probability of retaining activation be 0.95 in Drop-Activation. ResNet-164 consists of a stack of three modules. Each module contains 54 convolution layers but has different number of channels. We observe the statics of the output of the second module by evaluating its shift ratio. We compute the variances of the output for each channel and then average the channels’ variance. As shown in Figure 6, the shift ratio stabilize at in the end of training.

In summary, by keeping the statistical property of the internal output of hidden layers, Drop-Activation can be combined with Batch Normalization to improve performance.

## 6 Conclusion

In this paper, we propose Drop-Activation, a regularization method that introduces randomness on the activation function. Drop-Activation works by randomly dropping the nonlinear activations in the network during training and uses a deterministic network with modified nonlinearities for prediction.

The advantage of the proposed method is two-fold. Firstly, Drop-Activation provides a simple yet effective method for regularization, as demonstrated by the numerical experiments. Furthermore, this is supported by our analysis in the case of one hidden-layer. We show that Drop-Activation gives rise to a regularizer that penalizes the difference between nonlinear and linear networks. Future direction includes the analysis of Drop-Activation with more than one hidden-layer. Secondly, experiments verify that Drop-Activation improves the generalization in most the modern neural networks and cooperates well with some other popular training techniques. Moreover, we show theoretically and numerically that Drop-Activation maintains the variance during both training and testing times, and thus Drop-Activation can work well with Batch Normalization. These two properties should allow the wide applications of Drop-Activation in many network architectures.

Acknowledgments. H. Yang thanks the support of the start-up grant by the Department of Mathematics at the National University of Singapore. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

## 7 Appendix

Suppose that is the input vector. Let , where is a 0-1 vector, and the j-th component of is equal to 1 if the j-th component of is positive or is equal to 0 else. Then, . For simplification, denote

 S: =I−P+PDW1,x, Sp: =I−pI+pDW1,x, v: =W1x.

On one hand, . We expand it and obtain

 ∥W2SpW1x−y∥22=tr(W2SpvvTSpWT2)−2% tr(W2SpvyT)+tr(yyT), (16)

where is trace operator computing the sum of diagonal in the matrix. Function denotes converting the diagonal matrix into a column vector. Then we transform the first term of Eqn.(16), and get

 tr(W2SpvvTSpWT2)=tr(SpvvTSpWT2W2)=tr(diag(v)vec(Sp)vec% (Sp)Tdiag(v)WT2W2)=tr(vec(Sp)vec(Sp)Tdiag(v)WT2W2diag(v)). (17)

On the other hand, we have

 E∥W2[(I−P+Pr)W1x]−y∥22=E[∥W2Sv−y∥22]=E[tr(W2SvvTSWT2)]−2tr(W2SpvyT)+tr(yyT). (18)

where the expectation is taken with respect to the feature noise Similar to Eqn.(17), we have

 tr(W2SvvTSWT2)=tr(vec(S)vec(S)Tdiag% (v)WT2W2diag(v)). (19)

Since has property of linearity, take the expectation of Eqn. (19) with respect to to obtain

 Etr(W2SvvTSWT2)=tr(E(vec(S)vec(S)T)diag(v)WT2W2diag(v)). (20)

Denote , then

 E[vec(S)vec(S)T]−vec(Sp)vec(Sp)T=diag(E((1−Pi+Pidi)2)−(1−p+pdi)2)=p(1−p)(I−DW1,x)2. (21)

Then, using Eqn. (21), Eqn. (17), Eqn. (19), we can get the difference between Eqn. (16) and Eqn. (18), this is,

 E[tr(W2SvvTSWT2)]−tr(W2SpvvTSpWT2) = tr{(E(vec(S)vec(S)T)−vec(Sp)vec(Sp)T) diag(v)WT2W2diag(v)} = p(1−p)tr{(I−DW1,x)2diag(v)WT2W2% diag(v)} = p(1−p)tr{W2diag(v)(I−DW1,x)2diag(v)WT2} = p(1−p)∥W2(I−DW2,x)W1x∥22.

Note that , then we can get

 p(1−p)∥W2(I−DA,x)W1x∥22 = 1−pp∥W2(I−Sp)W1x∥22 = 1−pp∥W2W1x−W2rp(W1x)∥22.

Finally, we attain the difference between Eqn. (16) and Eqn. (18),

 1−pp∥W2W1x−W2rp(W1x)∥22.

### 7.1 Proof of Property (5.2)

Since it is easy to get

 E(x[i])=0,  E(r(x[i]))=1√2π,

and

 E(x[i]2)=1,  E(r(x[i])2)=12,

where the expectation is taken with respect to random variable We know that

 E(Xtrain)=d∑i=1wi{E((1−Pi+Pir)x[i])}=p∑di=1wi√2π, E(Xtest)=d∑i=1wi{E((1−p+pr)x[i])}=p∑di=1wi√2π,

where we take expectation with respect to features noise and inputs . That means In what follows, we compute and .

Since that,

 X2train =d∑i=1w2i((1−Pi)x[i]+Pir(x[i]))2 +2∑i

we take expectation and obtain,

 E(X2train) =d∑i=1w2iE((1−Pi)2x[i]2+2(1−Pi)Pix[i]r(x[i]) +P2ir(x[i])2)+2∑i

Therefore,

 Var(Xtrain) = d∑i=1w2i(1−p+12p)+p2π∑i

We finish . We are going to compute .

 X2test=d∑i=1w2i((1−p)x[i]+pr(x[i]))2 + 2∑i

We take expectation with respect to the input ,

 E(X2test) =d∑i=1w2iE((1−p)2x[i]2+2(1−p)px[i]r(x[i]) +p2r(x[i])2)+2∑i

and Therefore,

 Var(Xtest)=d∑i=1w2i((12−12π)p2−p+1).

So we have

 Var(Xtext)Var(Xtrain)=(12−12π)p2−p+11−12p−12πp2.