1 Introduction
Training convolution neural networks becomes more difficult with the depth increasing and even the training accuracy deceases for very deep networks. Identity mappings or transformations, which are adopted as skipconnections in deep residual networks
HeZRS16 , ease the training of very deep networks and make the accuracy improved.Identity transformations lead to shorter connections between layers close to the input and those close to the output. It is shown that identity transformations improve information flow in both forward propagation and backpropagation because the product of identity matrices is still an identity matrix, in other words,
multiple skipconnections is essentially like a single skipconnection no matter how many skipconnections there are.In this paper, we introduce two linear transformations and use them as skipconnections for improving information flow. The first one is an orthogonal transformation. Multiplying several orthogonal matrices, used to form the orthogonal transformations, yields an orthogonal matrix. The benefit is that information attenuation and explosion is avoided because the absolute values of the eigenvalues of an orthogonal matrix are always
. The second one is an idempotent transformation, whose transformation matrix is an idempotent matrix which, when multiplied by itself, yields itself. A sequence of idempotent transformations with the same idempotent matrices is equivalent to a single idempotent transformation. We show that the success essentially comes from feature reuse and gradient reuse in forward and backward propagation for maintaining the information and eliminating the gradient vanishing problem because of the express way through skipconnections.The empirical results show that singlebranch deep neural networks with idempotent and orthogonal transformations as skipconnections achieve perform similarly to those with identity transformations and that the performances are superior when applied to multibranch networks.
2 Related Works
In general, deeper convolutional neural networks leads to superior classification accuracy. An example is the improvement on the ImageNet classification from AlexNet
KrizhevskySH12 ( layers) to VGGNet SimonyanZ14a ( layers). However, going deeper increases the training difficulty. Techniques to ease the training include optimization techniques nair2010rectified ; HeZRS15 ; clevert2015fast ; IoffeS15 ; glorot2010understanding ; mishkin2015all ; neyshabur2015path and network architecture design. In the following, we discuss representative works on network architecture design.GoogLeNet SzegedyLJSRAEVR15 is one of the first works, designing network architectures to deal with the difficulty of training deep networks. It is built by repeating Inception blocks each of which contains short and long branches, and thus there are both short and long paths between layers close to the input layer and those close to the output layer, i.e., information flow is improved.
Inspired by Long ShortTerm Memory recurrent networks, highway networks
srivastava2015training adopt identity transformations together with adaptive gating mechanism, allowing computation paths along which information can flow across many layers without attenuation. It indeed eases the training of very deep networks, e.g., layers. Residual networks HeZRS16 also adopt identity transformations as skipconnections, but without including gating units, making training networks of thousands of layers easier. In this paper, we introduce two alternative transformations, orthogonal and idempotent transformations, which also improve information flow. We do not find that they learn residuals as claimed in HeZRS16 and but find that features and gradients are reused through the express way composed of skipconnections.FractalNets LarssonMS16a , deeplyfused nets WangWZZ16 , and DenseNets HuangLW16a present various multibranch structures, leading to short and long paths between layers close to the input layer and those close to the output layer. Consequently, the effective depth LarssonMS16a or the average depth WangWZZ16 is reduced a lot though the nominal depth is great and accordingly information flow is improved.
Deep supervision LeeXGZT15
associates a companion local output and accordingly a loss function with each hidden layer, which results in shorter paths from hidden layers to the loss layers. Its success provides an evidence that effective depth is crucial. FitNets
RomeroBKCGB14 , a studentteacher paradigm, train a thinner and deeper student network such that the intermediate representations approach the intermediate representations of a wider and shallower (but still deep) teacher network that is relatively easy to be trained, which is in some sense a kind of deep supervision, also reducing the effective depth.3 Orthogonal and Idempotent Transformations
A building block with a linear transformation used as the skipconnection is written as:
(1) 
Here, and are the input and output features. is the skipconnection, and is the transformation matrix.
is the regular connection, e.g., two convolutional layers with each followed by a ReLU activation function, and
is the parameters of the function . Following residual networks HeZRS16 , we design the networks by starting with a convolutional layer, repeating such building blocks, and appending a global pooling layer and a fullyconnected layer. Figure 1 illustrates such a block.The recursive equations below show how the features are forwardpropagated through building blocks and the gradients are backwardpropagated.
Forward propagation. The transformation function, transferring the feature , the input of the th building block to , the input of the th building block, is given as follows,
(2) 
where , and and are the input and the parameters of the th block.
Backward propagation. The gradient is backwardprorogated from to as below,
(3) 
where is the loss function.
In the following, we show that the feature is reused in any later feature instead of only and the gradient is reused in any early gradient such that the signal (information) is maintained and the vanishing problem is eliminated, for identity transformations, orthogonal transformations and idempotent transformations.
3.1 Identity transformations
Identity transformations, i.e., , are adopted in residual networks HeZRS16 . The forward and backward processes are rewritten as below,
(4)  
(5) 
It is obvious that there is a path along skipconnections, where (1) directly flows to though there are blocks; and (2) the gradient with respect to is directly backward sent to the gradient with respect to along the same path (both correspond to the first term of the righthand side of the above two equations).
3.2 Orthogonal transformations
An orthogonal transformation is a linear transformation, where the transformation matrix is orthogonal. Mathematically, a matrix is orthogonal if . We have the following property:
Property 1.
The product of an arbitrary number of orthogonal matrices is orthogonal: is orthogonal, if are orthogonal.
Thus, the forward process (Equation 2) is rewritten as follows,
(6) 
where and . We can see that is sent to via a single orthogonal transformation (corresponding to the first term of the righthand side) One notable property is that any orthogonal transformation
preserves the length of vectors:
. This means that through the path formed by skipconnections, the norm of the vector is maintained.The backward process (Equation 3) becomes
(7) 
Again, there is a path, formed by skipconnections and behaving like a single orthogonal transformation layer, where the gradient with respect to is sent to the gradient with respect to no matter how many building blocks there are between and , and the norm of the gradient is maintained.
Conversion to identity transformations. We show that the orthogonal transformation can be absorbed into the regular connection and the skipconnection is reduced to an identity transformation, which is illustrated in Figure 1.
Theorem 1.
For a network, which is formed with orthogonal transformations as skipconnections, there exists another network,which is formed with identity transformations as skipconnections, such that, given an arbitrary input , the final outputs of the two networks are the same.
Proof.
We prove the theorem by considering the networks where the number of channels are not changed). For the networks with the number of channels changed, the proof is a little complex but similar.
The network with orthogonal transformations ( building blocks) is mathematically formed as below,
(8)  
(9) 
We construct a network with identity transformations,
(10)  
(11) 
satisfying
(12)  
(13) 
It can be easily verified that given an arbitrary input , the final outputs of the two networks are the same: .
Thus, the theorem holds. ∎
3.3 Idempotent transformations
An idempotent transformation is a linear transformation in which the transformation matrix is an idempotent matrix. An idempotent matrix is a matrix which, when multiplied by itself, yields itself.
Defnition 1 (Idempotent matrix).
The matrix is idempotent if and only if , or equivalently , where is a positive integer.
The forward process (Equation 2) becomes
(14) 
where when , and otherwise. This implies that is directly sent to through skipconnections that behave like a single skipconnection.
Similarly, the backward process (Equation 3) becomes
(15) 
which again implies that the gradient with respect to is directly sent to the gradient with respect to through the skipconnections.
Information maintenance. Different from identity transformations and orthogonal transformations, idempotent transformations maintain the vector lying in the column space of :
(16) 
where and is an arbitrary dimensional vector.
Apparently, when a vector lies in the null space of the column space, i.e., , it looks that the skip connections do not help to improve information flow. Considering another term in the righthand side of Equation 14,
(17) 
we can find , if is formed by convolutional layers and ReLU layers, such that
(18) 
which means that there is a path along which the information does not vanish. Vanishing is still possible though rare, e.g., in the case lies in the null space of . There is similar analysis for gradient maintenance.
Diagonalization. It is known that an idempotent matrix is diagonalizable: , where is a diagonal matrix whose diagonal entries are or and is invertible. We illustrate it in Figure 1. A network containing blocks formed with idempotent transformations written as follows,
(19)  
(20) 
can be transferred to a network with skip connections formed by linear transformations whose transformation matrix is a diagonal matrix composed of and :
(21)  
(22)  
(23) 
3.4 Discussions
We can easily show that identity transformations are idempotent and orthogonal transformations as (idempotent) and (orthogonal). Here we discuss a bit more on feature reuse, gradient backpropagation, and extensions.
Feature reuse. We have generalized identity transformations to orthogonal transformations and idempotent transformations, to eliminate information vanishing and explosion. One point we want to make clearer is that Equations 4, 6 and 14 (for forward propagation) hold for any , . In other words, reuses all the previous features: rather than only . We have similar observation on gradient reuse.
Gradient backpropagation. Considering the network without skipconnections, backpropagating the gradient through regular connections yields the gradient with respect to : . With linear transformations as skipconnections, . One reason for gradient vanishing () means that lies in the null space of . Adding a proper to each in some sense shrinks the null space, and reduces the chance of gradient vanishing. It is as expected that a transformation with higherrank leads to lower chance of gradient vanishing. We empirically justify it in Figure 2.
Extension of idempotent transformations. We extend idempotent transformations: , by relaxing the diagonal entries (eigenvalues) in . The relaxed conditions are (i) the absolute values of diagonal entries are not larger than and (ii) there is as least one diagonal entry whose absolute value is . Considering a special case that the absolute values of eigenvalues are only or , the absolute vector in the column space of is maintained is: . A typical example is a periodic matrix: , where is a positive integer.
3.5 Multibranch networks
The multibranch networks have been studied in many recent works AbdiN16 ; TargAL16 ; XieGDTH16 ; ZhaoWLTZ16 . We study the application of the two linear transformations to multibranch structures.
Orthogonal transformations. Theorem 1 shows how orthogonal transformations are transformed to identity transformations. and still holds for multibranch structures. Figure 1 depicts the regular connection in the block converted from a block with an orthogonal transformation for multiple branches. We can see that the converted regular connection cannot be separated into multiple branches (as shown in Figure 1) because of two extra transformations: pretransformation (shortened as in Figure 1) and posttransformation (shortened as in Figure 1). The two transformations are essentially convolutions, which exchange the information across the branches. Without the two transformations (e.g., in residual networks using identity transformation), there is no interaction across these branches.
Idempotent transformations. We have shown that idempotent transformations can be transformed to diagonalized idempotent transformations. There are two extra transformations in regular connections: pretransformation and posttransformation (see Figure 1). But the diagonal entries of the diagonal idempotent matrix are and . Compared to identity transformations, the regular connection contains two extra convolutions, which is similar to orthogonal transformations and results in information exchange across the branches.
4 Experiments
4.1 Datasets
CIFAR. CIFAR and CIFAR Alex2009 are subsets of the million tiny image database TorralbaFF08 . Both datasets contain color images with training images and testing images. The CIFAR dataset includes classes, each containing images, for training and for testing. The CIFAR dataset includes classes, each containing images, for training and for testing. We follow a standard data augmentation scheme widely used for this dataset HeZRS16 ; HuangLW16a ; HuangSLSW16 ; LeeXGZT15 ; LinCY13 ; XieGDTH16
: We first zeropad the images with
pixels on each side, and then randomly crop them to produceimages, followed by horizontally mirroring half of the images. We normalize the images by using the channel means and standard deviations.
SVHN. The Street View House Numbers (SVHN) dataset is obtained from house numbers in Google Street View images. It contains training images, testing images and additional training images. Following HuangSLSW16 ; LeeXGZT15 ; LinCY13 , we select out samples per class from the training set and samples from the additional set, and use the remaining images as the training set without any data augmentation.
4.2 Setup
Networks. The network starts with a convolutional layer, stages, where each stage contains building blocks and there are two downsampling layers, and a global pooling layer followed by a fullyconnected layer outputting the classification result.
In our experiments, we consider kinds of regular connections forming building blocks: (a) single branch, (b)
branches, and (c) depthwise convolution (an extreme multibranch connection, each branch contains one channel). Each branch consists of batch normalization, convolution, batch normalization, ReLU and convolution (BN ), which is similar to the preactivation residual connection
HeZRS16ECCV . We empirically study two idempotent transformations and two orthogonal transformations and compare them with identity transformations.Idempotent transformations: The first one is a mergeandrun style ZhaoWLTZ16 ; ZhaoWLTZ17 , denoted by IdempotentMR
(24) 
which is a block matrix containing blocks with being the number of branches, and each block is an identity matrix. The second one is obtained by subtracting from the identity matrix:
(25) 
which is named as IdempotentCMR (c=complement). The ranks of the two matrices are and , where is the size of the matrix or the total number of channels.
Orthogonal transformations: The first one is built from Kronecker product: , where is the Kronecker product operation, and
(26) 
We name it OrthogonalTP. The second one is a random orthogonal transformation, named OrthogonalRandom, also constructed using Kronecker product. In each block, we generate different orthogonal transformations.
Training.
We use the SGD algorithm with the Nesterov momentum to train all the networks for
epochs on CIFAR/CIFAR and epochs on SVHN, both with a total minibatch size . The initial learning rate is set to , and is divided by at and of the total number of training epochs. Following residual networks HeZRS15 , the weight decay is , the momentum is , and the weights are initialized as in residual networks HeZRS15. Our implementation is based on Keras and TensorFlow
AbadiABBCCCDDDG16 .4.3 Results
Singlebranch. We compare four skipconnections: Identity transformation, IdempotentCMR, OrthogonalTP and OrthogonalRandom. To form the idempotent matrix for IdempotentCMR, we set to be the number of the channels, i.e., is a matrix with all entries being . We do not evaluate IdempotentMR because in this case its rank is only , whose performance is expected to be low. In general, idempotent transformations with lower ranks perform worse than those with higher ranks. This is empirically verified in Figure 2.
Table 1 shows the results over networks of depth and , containing and building blocks, respectively. One can see that the results with idempotent transformations and orthogonal transformations are similar to those with identity transformations: empirically demonstrating that idempotent transformations and orthogonal transformations improve information flow.
Depth  Mapping  Width  Accuracy  

CIFAR  CIFAR  SVHN  
20  Identity  16,32,64  
IdempotentCMR  16,32,64  
OrthogonalTP  16,32,64  
OrthogonalRandom  16,32,64  
56  Identity  16,32,64  
IdempotentCMR  16,32,64  
OrthogonalTP  16,32,64  
OrthogonalRandom  16,32,64 
Depth  Mapping  Width  Accuracy  

CIFAR  CIFAR  SVHN  
20  Identity  32,64,128  
IdempotentCMR  32,64,128  
IdempotentMR  32,64,128  
OrthogonalTP  32,64,128  
OrthogonalRandom  32,64,128  
32  Identity  32,64,128  
IdempotentCMR  32,64,128  
IdempotentMR  32,64,128  
OrthogonalTP  32,64,128  
OrthogonalRandom  32,64,128  
56  Identity  32,64,128  
IdempotentCMR  32,64,128  
IdempotentMR  32,64,128  
OrthogonalTP  32,64,128  
OrthogonalRandom  32,64,128  
110  Identity  32,64,128  
IdempotentCMR  32,64,128  
IdempotentMR  32,64,128  
OrthogonalTP  32,64,128  
OrthogonalRandom  32,64,128 
Fourbranch. We compare the results over the networks, where each regular connection consists of four branches. The results are shown in Table 2. One can see that the idempotent and orthogonal transformations perform better than identity transformations. The reason is that compared to identity transformations, the designed idempotent and orthogonal transformations introduce interactions across the four branches.
Depthwise. We evaluate the performance over extreme multibranch networks: depthwise networks, where each branch only contains a single channel. Table 3 shows the results. One can see that the comparison is consistent to the branch case.
Depth  Mapping  Width  Accuracy  

CIFAR  CIFAR  SVHN  
20  Identity  64, 128,256  
IdempotentCMR  64,128,256  
OrthogonalTP  64,128,256  
OrthogonalRandom  64,128,256  
56  Identity  64,128,256  
IdempotentCMR  64,128,256  
OrthogonalTP  64,128,256  
OrthogonalRandom  64,128,256 
5 Conclusions
We introduce two linear transformations, orthogonal and idempotent transformations, which, we show theoretically and empirically, behave like identity transformations, improving information flow and easing the training. One interesting point is that the success stems from feature and gradient reuse through the express way composed of skipconnections, for maintaining the information during flow and eliminating the gradient vanishing problem.
References
 [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Largescale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016.
 [2] Masoud Abdi and Saeid Nahavandi. Multiresidual networks. CoRR, abs/1609.05672, 2016.
 [3] Anonymous. Deep convolutional neural networks with mergeandrun mappings. Under review.
 [4] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 [5] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pages 249–256, 2010.
 [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, pages 1026–1034. IEEE Computer Society, 2015.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.

[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity mappings in deep residual networks.
In
European Conference on Computer Vision
, pages 630–645. Springer, 2016.  [9] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [10] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
 [11] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [12] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, NIPS, pages 1106–1114, 2012.
 [14] Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultradeep neural networks without residuals. CoRR, abs/1605.07648, 2016.
 [15] ChenYu Lee, Saining Xie, Patrick W. Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In Guy Lebanon and S. V. N. Vishwanathan, editors, AISTATS, volume 38 of JMLR Workshop and Conference Proceedings. JMLR.org, 2015.
 [16] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.
 [17] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.

[18]
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.
In
Proceedings of the 27th international conference on machine learning (ICML10)
, pages 807–814, 2010.  [19] Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Pathsgd: Pathnormalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pages 2422–2430, 2015.
 [20] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
 [21] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [22] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.

[23]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich.
Going deeper with convolutions.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1–9, 2015.  [24] Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: Generalizing residual architectures. CoRR, abs/1603.08029, 2016.

[25]
Antonio Torralba, Robert Fergus, and William T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.
IEEE Trans. Pattern Anal. Mach. Intell., 30(11):1958–1970, 2008.  [26] Jingdong Wang, Zhen Wei, Ting Zhang, and Wenjun Zeng. Deeplyfused nets. CoRR, abs/1605.07716, 2016.
 [27] Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. CoRR, abs/1611.05431, 2016.
 [28] Liming Zhao, Jingdong Wang, Xi Li, Zhuowen Tu, and Wenjun Zeng. On the connection of deep fusion to ensembling. CoRR, abs/1611.07718, 2016.
Comments
There are no comments yet.