feizi2013network_reference_environment
Reference environment for ['Network deconvolution as a general method to distinguish direct dependencies in networks'](https://dx.doi.org/10.1038/nbt.2635)
view repo
Convolution is a central operation in Convolutional Neural Networks (CNNs), which applies a kernel or mask to overlapping regions shifted across the image. In this work we show that the underlying kernels are trained with highly correlated data, which leads to co-adaptation of model weights. To address this issue we propose what we call network deconvolution, a procedure that aims to remove pixel-wise and channel-wise correlations before the data is fed into each layer. We show that by removing this correlation we are able to achieve better convergence rates during model training with superior results without the use of batch normalization on the CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST datasets, as well as against reference models from "model zoo" on the ImageNet standard benchmark.
READ FULL TEXT VIEW PDFReference environment for ['Network deconvolution as a general method to distinguish direct dependencies in networks'](https://dx.doi.org/10.1038/nbt.2635)
Images of natural scenes depict homogeneous color and texture regions delineated by edges, and adjacent pixels are statistically highly correlated Olshausen and Field [1996], Hyvrinen et al. [2009]. We can imagine that this correlation is somehow induced by an external process: in the same way correlations are introduced when a Gaussian kernel blurs an input image, we can think of the natural images themselves as being blurred and correlated by some unknown operator, which we call the tangling by nature (Figure 1
). This tangling complicates object recognition tasks as adjacent pixels contain highly redundant information, and convolutional networks endure the cost of processing this information without accruing substantial benefits. As applying a Gaussian blur complicates human recognition (leading to the need for corrective lenses), the tangling effect in natural images may also complicate machine learning in neural networks, requiring its own method of correcting the image, a method we call
network deconvolution.Convolutional Neural Networks (CNNs) LeCun et al. [1998]
, the most widely used networks in visual learning, have demonstrated superior performance in a variety of tasks ranging from Computer Vision
Hinton et al. [2012], over Natural Language Processing
Collobert et al. [2011]Mnih et al. [2015], owing to their ability to learn their own convolutional kernels that extract meaningful features from the input.In CNNs, resulting from a combination of the tangling effect and the fact that the kernel shifts only slightly between each receptive field, layers of the neural network are in reality re-learning much of the same information over and over, a factor that we believe slows down learning. We refer to the correlation between the raw pixels in a single image or feature map as the pixel-wise correlation. Similarly, in the case of different channels of a hidden layer of the network, there is a strong correlation or "cross-talk" between these channels because these channels are provided with statistically very similar inputs; we refer to this as channel-wise correlation. The goal of this paper is to remove redundant features that will not lead to effective learning. In this paper we propose two methods for attacking this problem: Full deconvolution is the removal of both pixel-wise and channel-wise correlations, whereas channel deconvolution is the removal of just channel-wise correlations. We define network deconvolution as the application of full deconvolution to the input at every layer of a network.
Our contributions are twofold:
We introduce a pixel-wise decorrelation at each layer of the network, which we call network deconvolution, that aims to ensure that at every step only sparse and discriminative data is used for learning.
We demonstrate consistently superior accuracy of our approach on the CIFAR-10, CIFAR-100, Fashion-MNIST, and ImageNet datasets compared to batch normalization, showing that deconvolution can be used as a generic procedure in a variety of architectures.
Since its introduction, batch normalization has been the main normalization technique Ioffe and Szegedy [2015]
to facilitate the training of deep networks using stochastic gradient descent (SGD). Many techniques have been introduced to address cases for which batch normalization does not perform well. These include training with a small batch size
Wu and He [2018], and on recurrent networks Salimans and Kingma [2016]. However, to our knowledge, none of these methods has demonstrated superior performance on the ImageNet dataset.In the signal processing community, our network deconvolution is what would be referred to as a whitening mechanism. There have been previous attempts to whiten the feature channels and to utilize second-order information. For example, in Martens and Grosse [2015], Desjardins et al. [2015] the authors approximate second-order information using the Fisher information matrix. However, we found that implementations of this form are unstable for deep networks, a fact that was already noted by the authors of Desjardins et al. [2015]. One reason is that these methods directly compute a matrix inversion and that is excluded from the back propagation. We point out in Eq. 4 that there is a first-order correction term that needs to be included to ensure accurate computation of gradients. More importantly, previous methods only addressed the correlation of different channels (that is, across features), but did not consider correlation between nearby pixels (within a single feature). This is critical, because kernels need to be trained from the tiny discrepancies between the highly correlated neighboring pixels.
In this section we describe the deconvolution operation, and give some intuition for why it helps the network learn. The task of the deconvolution operation is to stretch a vector of random variables such that each random variable is independent and identically distributed (i.i.d) in the statistical sense. The goal of this is to remove the correlation that a variable has with another variable. We will cover two types of deconvolutions,
pixel and channel deconvolution. Network Deconvolution is thus the application of these at each layer of the neural network.Given a batch of vectors , organized as rows of column vectors, we calculate their covariance matrix . Here is the centered matrix, such that the mean of each column is . We then calculate an approximated inverse square root of the covariance matrix and multiply this with the centered vectors ; this decorrelates each dimension (pixel, channel) from other dimensions. In a sense, we are deconvolving the originating from the statistics of the raw image or resulting from the training process. The untangling operator is this inverse square root of the covariance matrix. If computed perfectly, this would turn the covariance of the resulting covariance matrix,
, into the identity matrix
.When dealing with multiple feature channels in a convolution layer, the data matrix is constructed by flattening (large) image patches in each channel into columns and then concatenating these columns. As noted previously, there are two types of correlation that affect the learning of weights: one is the cross-channel correlation, the other is the more prominent autocorrelation between nearby pixels.
In Fig. 2 (top right) we show the calculated covariance matrix of the data matrix in the first layer of a VGG network Simonyan and Zisserman [2014] from ImageNet. The first layer is a convolution that mixes RGB channels. The total dimension of the weights is , the corresponding covariance matrix is . The diagonal blocks correspond to the pixel-wise correlation within neighborhoods. The off diagonal blocks correspond to correlation of pixels across different channels. Generally natural images demonstrate stronger pixel-wise correlation than cross-channel correlation, as the diagonal blocks are significantly brighter than the off diagonal blocks.
We denote the deconvolution operation as , the input to next layer can be written as:
(1) |
where is the (right) matrix multiplication operation, is the input coming from the th layer, is the deconvolution operation on that input, and is the weights in the layer. In general, the deconvolution operation removes the correlations between the columns.
For simplicity, we assume on all layers the activation function is a sample-variant matrix multiplication. The popular
Nair and Hinton [2010] activation falls into this category. It is important to note that essentially this function multiplies the input vector by a diagonal matrix with entries either 1 or 0, based on the sign of the input sample. Because of this the average effect of on a batch of data is an attenuating effect, in that it can only either allow values to pass or turn them off, (setting them to 0), and we assume that its operator norm is smaller than . We investigate the transform of inputs to the next linear/convolution layer.(2) |
where is the (deconvolved) input to the linear layer and is the (deconvolved) input to the next linear layer.
If we assume the output dimension and the input dimension to be the same, and if both and are i.i.d., then is an on expectation, which keeps the statistical properties of the signal unchanged in the forward propagation. An ideal isometry also keeps properties of gradients unchanged in the backward propagation.
(3) |
(4) |
To emphasize this property on the backward propagation of the error signal: if has a diminishing effect on the gradients, then is expected to raise the values to counteract.
It is important to note that the procedure does not change the learning problem, as the standard network training is solving for . Instead, this change of variables procedure removes the correlations between feature columns and improves the conditioning of the optimization problem.
Another favorable property of deconvolution is that after applying it, optimal solutions may be computed in one iteration. Assume we are given a linear regression problem with
loss:(5) |
where is the output, is the input, and is the weight matrix we are trying to solve for. Then, we have as an
/min loss function:
(6) |
An explicit solution is given as = 0
(7) |
If the input is deconvolved, we have , then . On the other hand, to conduct one iteration of gradient descent on Eq. 6, we have:
(8) |
If the input is deconvolved then is optimal and with one iteration we will have .
In our experiments section, we show that the deconvolution operation does in fact significantly speed up training on a variety of standard benchmark tasks.
In an effort to simplify and understand the calculation of a single kernel with the input data in a batch, we can rewrite the convolution operation in matrix form as: . In essence, we are converting the entire process of convolution / shifting over the image into one large matrix multiplication. In the 2-dimensional case, is the flattened 2D . The first column of corresponds to the flattened image patch of . Neighboring columns correspond to shifted patches of : . This is done using the commonly used function (See Fig. 2). When formulated in this manner, the resulting combined matrix is extremely ill-posed. This ill-posedness slows down the training algorithm, and cannot be addressed by normalization methods Ioffe and Szegedy [2015]. Solving for the kernel given the input data and the output data
, is known as a kernel estimation problem
Ye et al. [2014]. It takes tens or hundreds of gradient descent iterations to converge to a practically close enough solution. However, we emphasize that a close form solution exists as given by Eq. 7.For convolutional networks, we usually have multiple input feature channels and multiple kernels in a layer. We vectorize and concatenate all the kernels to get and follow Algorithm 1 to construct and . Here is introduced to improve stability. We then apply the deconvolution operation to to remove the correlation between neighboring pixels and across different channels. The deconvolved data is then multiplied with . The full equation becomes (Fig. 2). The output matrix is then reshaped into the output shape of the layer.
We compute the approximate inverse square root of the covariance matrix at low cost using the Denman-Beavers iteration method Denman and Beavers [1976] in a simple and straightforward fashion. This method is important because there is a first-order correction (Eq. 4) term that needs to be included to avoid accumulating errors when training deep networks. Given a symmetric positive definite covariance matrix , Denman-Beavers iterations start with initial values , . The iteration is defined as: , and Lin and Maji [2017]. It is important to point out a practical implementation detail: when we have input feature channels, and the kernel size is , then the size of the covariance matrix is . The covariance matrix becomes large in deeper layers of the network. Inverting such a matrix is slow and highly unstable. In our implementation, we evenly divide the feature channels into smaller groups Ye et al. [2017], Wu and He [2018], Ye et al. [2019]. Usually we set . The mini-batch covariance of a small groups has a manageable size of . Denman-Beavers iterations are therefore conducted on small matrices. We notice that only a few () iterations are necessary to deconvolve both the pixel correlation and the (grouped) channel correlation, leading to fast convergence and better results. The computational complexity of a regular convolution layer is . Our computation of the covariance matrix has complexity . Solving for the inverse square root takes . The overall complexity is , which is smaller than the convolution operation in practice.
Our deconvolution applied at each layer removes the pixel-wise and channel-wise correlation and transforms the original dense representations into sparse representations, without losing information. This is a desired property of the input data, and there is a whole field developed around sparse representations Olshausen and Field [1996], Hyvrinen et al. [2009]. In Fig. 3, we visualize the deconvolution operation on an input and show how the resulting representations ( 3(d)) are much sparser than the original image ( 3(b)). This also holds true for hidden layer representations. We show in the supplementary material that sparse representations has made classic regularizations more effective.
We now describe experimental results validating that network deconvolution is a powerful and successful tool for sharpening the data. In fact, our experiments show that it outperforms identical networks using batch normalization Ioffe and Szegedy [2015]
, a widely used method for input normalization. As we will see across all experiments, deconvolution not only improves the final accuracy but also decreases the amount of iterations it takes to learn a reasonably good set of weights in a small number of epochs.
We note that in its current implementation, the runtime of our training using deconvolution is slower than convolution using wallclock as a metric; however, we believe this is due to a suboptimal implementation in PyTorch and can be addressed in future work by more optimized code in CUDA. Currently, we are also developing code in CUDA that will bring the deconvolution operation to the same level of parallelization as the PyTorch framework surrounding it.
We make a note to plot and compare against previous work Huang et al. [2018], Ye et al. [2019] (see Related Work) that only applied per-channel decorrelation and show that our network deconvolution technique outperforms only a network channel-wise deconvolution.
Linear Regression with
loss and Logistic Regression:
As a first experiment, we ran network deconvolution on a simple linear regression task to show its efficacy. We select the Fashion-MNIST dataset, which contains article images for training and for testing. The dataset has categories. It is noteworthy that with binary targets and the loss, the problem has an explicit solution if we feed the whole dataset as input. This problem is the classic kernel estimation problem, where we need to solve for optimal kernels to convolve with the inputs and minimizes the loss with the binary targets. During our experiment, we notice it is important to use a small learning rate for vanilla training to prevent divergence. However, we notice with deconvolution, it is possible to use the optimal learning rate and get high accuracy as well. It takes iterations to get to a low cost under the mini-batch setting (Fig. 4(a)). This even holds if we change the loss to logistic loss(Fig. 4(b)).CIFAR-10 | DC | BN | CD |
---|---|---|---|
VGG-11 | 91.33 | 89.15 | 90.23 |
DenseNet-121 | 94.71 | 93.45 | 93.65 |
ResNet-50 | 94.05 | 90.6 | 91.7 |
CIFAR-100 | DC | BN | CD |
---|---|---|---|
VGG13 | 74.74 | 70.57 | 74.31 |
Densenet121 | 80.27 | 79.2 | 79.09 |
Resnet50 | 80.43 | 77.78 | 76.62 |
Convolutional Networks on CIFAR-10/100: We ran deconvolution on the CIFAR-10 (Fig. 5, Table 1(left)) and CIFAR-100 (Fig. 6, Table 1(right)) datasets, where we again compare the use of network deconvolution versus the use of batch normalization and the use of network channel-only deconvolution. Across different network architectures for both datasets, deconvolution significantly improves the final accuracy on these well-known datasets. We find that deconvolution leads to faster convergence, On the CIFAR-10 dataset, with 20 epochs of training leading to results that were only achievable using standard training for over 100 epochs.
As settings we remove all batch normalization in the networks and replace them with deconvolution before each convolution/fully-connected layer. For convolutional layers, we split the feature channels into groups before calculating the covariance matrix. For fully-connected layers, we split the channels into groups. Here, we showcase the generalizability of deconvolution across a variety of different CNN architectures. We report results using some of the most popular architectures, ResNetHe et al. [2015], VGGNetSimonyan and Zisserman [2014], and DenseNetHuang et al. [2016], where the standard batch normalization procedure has been replaced with pixel and channel deconvolution. For the CIFAR-10 experiments, we used a batch size of 128, and a weight decay of .001, to demonstrate the speed of convergence. For CIFAR-100, we used a batch size of 256 and weight decay of 005, with 100 epochs.
Convolutional Networks on ImageNet: We tested two widely-used model architectures (VGG-11, ResNet-18) from the PyTorch model zoo and find significant improvements on both networks over the reference models in the model zoo. Notably, for the VGG-11 network, we notice our method has led to significant improved accuracy, the top-1 accuracy is even higher than , reported by the reference VGG-13 model trained with batch normalization. The improvement introduced by network deconvolution () is twice as large as the introduction of batch normalization (). This fact also suggests us improving the training methods may lead to more improvements than improving the architecture.
As settings we keep most of the default settings to train the two models. For deconvolution, we use the settings as described above with only one modification. For deconvolution of the fully-connected layers, we split the features into groups. Our conjecture is that for this complex dataset, the full feature covariance structure of the whole dataset is under-represented with a small batch size. However, dividing these feature into groups alleviates this issue. The networks are trained for epochs with batch size , weight decay . The initial learning rates are , respectively for ResNet-18 and VGG-11 as described in the paper. We used cosine annealing to smoothly decrease the learning rate to compare the curves.
Non-Convolutional, Multi-layer Perceptron Networks: Finally, we ran experiments to confirm that the network deconvolution procedure can extend to non-convolutional layers via channel deconvolution, and is thus capable of improving classification on datasets not just important to computer vision but also to the broader machine learning community. We constructed a 3-layer fully-connected network that has 128 hidden nodes in each layer. For the activation function, we use the . As with the other experiments, we compare the use of batch normalization, channel-only deconvolution, and full deconvolution (pixel and channel). Indeed, applying deconvolution to MLP networks outperforms batch normalization, as shown in Fig. 4.
ImageNet | No Norm. | REF BN | CD | DC |
---|---|---|---|---|
VGG-11 | 69.02 | 70.38 | 71.45 | 71.74 |
Resnet-18 | N/A | 69.76 | 69.80 | 70.65 |
We would like to express our gratitude to Prof. Brian Hunt for his insightful comments.
In this paper we presented network deconvolution, a novel normalization method for pixel-wise decorrelation. The method was evaluated extensively and shown to improve the optimization efficiency over standard Batch Normalization. We provided a thorough analysis regarding its performance and demonstrated consistent performance improvements of the deconvolution operation on multiple major benchmarks. Our proposed deconvolution operation is straightforward in terms of implementation and can serve as a good alternative to Batch Normalization.
Source code can be found at: https://github.com/deconvolutionpaper/deconvolution
If two features correlate, weight decay regularization is less effective. If are strongly correlated features, but differ in scale, and if we look at: . The weights is likely to co-adapt during the training and weight decay is likely to be more effective on the larger coefficient. The other, small coefficient is left less penalized. Network deconvolution reduces the co-adaptation of weights, weight decay has become less ambiguous and more effective (Fig. 8(a)).
Fig. 9 shows the inputs to the -th convolution layer in . This input is the output of a activation function. The deconvolution operation first subtract the mean, and then remove the correlation between nearby pixels, resulting in a sharper and sparser representation.
We demonstrate the loss curves using different settings when training the VGG-11 network on the ImageNet dataset(Fig. 8(b)). We can see network deconvolution leads to significantly faster decay in training loss.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 791–800, 2018.