Orthogonal and Idempotent Transformations for Learning Deep Neural Networks

07/19/2017 ∙ by Jingdong Wang, et al. ∙ 0

Identity transformations, used as skip-connections in residual networks, directly connect convolutional layers close to the input and those close to the output in deep neural networks, improving information flow and thus easing the training. In this paper, we introduce two alternative linear transforms, orthogonal transformation and idempotent transformation. According to the definition and property of orthogonal and idempotent matrices, the product of multiple orthogonal (same idempotent) matrices, used to form linear transformations, is equal to a single orthogonal (idempotent) matrix, resulting in that information flow is improved and the training is eased. One interesting point is that the success essentially stems from feature reuse and gradient reuse in forward and backward propagation for maintaining the information during flow and eliminating the gradient vanishing problem because of the express way through skip-connections. We empirically demonstrate the effectiveness of the proposed two transformations: similar performance in single-branch networks and even superior in multi-branch networks in comparison to identity transformations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training convolution neural networks becomes more difficult with the depth increasing and even the training accuracy deceases for very deep networks. Identity mappings or transformations, which are adopted as skip-connections in deep residual networks 

HeZRS16 , ease the training of very deep networks and make the accuracy improved.

Identity transformations lead to shorter connections between layers close to the input and those close to the output. It is shown that identity transformations improve information flow in both forward propagation and back-propagation because the product of identity matrices is still an identity matrix, in other words,

multiple skip-connections is essentially like a single skip-connection no matter how many skip-connections there are.

In this paper, we introduce two linear transformations and use them as skip-connections for improving information flow. The first one is an orthogonal transformation. Multiplying several orthogonal matrices, used to form the orthogonal transformations, yields an orthogonal matrix. The benefit is that information attenuation and explosion is avoided because the absolute values of the eigenvalues of an orthogonal matrix are always

. The second one is an idempotent transformation, whose transformation matrix is an idempotent matrix which, when multiplied by itself, yields itself. A sequence of idempotent transformations with the same idempotent matrices is equivalent to a single idempotent transformation. We show that the success essentially comes from feature reuse and gradient reuse in forward and backward propagation for maintaining the information and eliminating the gradient vanishing problem because of the express way through skip-connections.

The empirical results show that single-branch deep neural networks with idempotent and orthogonal transformations as skip-connections achieve perform similarly to those with identity transformations and that the performances are superior when applied to multi-branch networks.

2 Related Works

In general, deeper convolutional neural networks leads to superior classification accuracy. An example is the improvement on the ImageNet classification from AlexNet 

KrizhevskySH12 ( layers) to VGGNet SimonyanZ14a ( layers). However, going deeper increases the training difficulty. Techniques to ease the training include optimization techniques nair2010rectified ; HeZRS15 ; clevert2015fast ; IoffeS15 ; glorot2010understanding ; mishkin2015all ; neyshabur2015path and network architecture design. In the following, we discuss representative works on network architecture design.

GoogLeNet SzegedyLJSRAEVR15 is one of the first works, designing network architectures to deal with the difficulty of training deep networks. It is built by repeating Inception blocks each of which contains short and long branches, and thus there are both short and long paths between layers close to the input layer and those close to the output layer, i.e., information flow is improved.

Inspired by Long Short-Term Memory recurrent networks, highway networks 

srivastava2015training adopt identity transformations together with adaptive gating mechanism, allowing computation paths along which information can flow across many layers without attenuation. It indeed eases the training of very deep networks, e.g., layers. Residual networks HeZRS16 also adopt identity transformations as skip-connections, but without including gating units, making training networks of thousands of layers easier. In this paper, we introduce two alternative transformations, orthogonal and idempotent transformations, which also improve information flow. We do not find that they learn residuals as claimed in HeZRS16 and but find that features and gradients are reused through the express way composed of skip-connections.

FractalNets LarssonMS16a , deeply-fused nets WangWZZ16 , and DenseNets HuangLW16a present various multi-branch structures, leading to short and long paths between layers close to the input layer and those close to the output layer. Consequently, the effective depth LarssonMS16a or the average depth WangWZZ16 is reduced a lot though the nominal depth is great and accordingly information flow is improved.

Deep supervision LeeXGZT15

associates a companion local output and accordingly a loss function with each hidden layer, which results in shorter paths from hidden layers to the loss layers. Its success provides an evidence that effective depth is crucial. FitNets 

RomeroBKCGB14 , a student-teacher paradigm, train a thinner and deeper student network such that the intermediate representations approach the intermediate representations of a wider and shallower (but still deep) teacher network that is relatively easy to be trained, which is in some sense a kind of deep supervision, also reducing the effective depth.

Figure 1: (a) A building block with a linear transformation ; (b) An equivalent form for an orthogonal transformation; and correspond to pre-orthogonal transformation and post-orthogonal transformation. (c) An equivalent form for an idempotent transformation. (d) The regular connection with two branches converted from an orthogonal or idempotent transformation, where and are pre-linear transformation and post-linear transformation. (e) Two separate branches, where () and () are pre-linear transformation and post-linear transformation. The network in (d) cannot be converted to that in (e) for general pre- and post-transformations.

3 Orthogonal and Idempotent Transformations

A building block with a linear transformation used as the skip-connection is written as:

(1)

Here, and are the input and output features. is the skip-connection, and is the transformation matrix.

is the regular connection, e.g., two convolutional layers with each followed by a ReLU activation function, and

is the parameters of the function . Following residual networks HeZRS16 , we design the networks by starting with a convolutional layer, repeating such building blocks, and appending a global pooling layer and a fully-connected layer. Figure 1 illustrates such a block.

The recursive equations below show how the features are forward-propagated through building blocks and the gradients are backward-propagated.

Forward propagation. The transformation function, transferring the feature , the input of the th building block to , the input of the th building block, is given as follows,

(2)

where , and and are the input and the parameters of the th block.

Backward propagation. The gradient is backward-prorogated from to as below,

(3)

where is the loss function.

In the following, we show that the feature is reused in any later feature instead of only and the gradient is reused in any early gradient such that the signal (information) is maintained and the vanishing problem is eliminated, for identity transformations, orthogonal transformations and idempotent transformations.

3.1 Identity transformations

Identity transformations, i.e., , are adopted in residual networks HeZRS16 . The forward and backward processes are rewritten as below,

(4)
(5)

It is obvious that there is a path along skip-connections, where (1) directly flows to though there are blocks; and (2) the gradient with respect to is directly backward sent to the gradient with respect to along the same path (both correspond to the first term of the right-hand side of the above two equations).

3.2 Orthogonal transformations

An orthogonal transformation is a linear transformation, where the transformation matrix is orthogonal. Mathematically, a matrix is orthogonal if . We have the following property:

Property 1.

The product of an arbitrary number of orthogonal matrices is orthogonal: is orthogonal, if are orthogonal.

Thus, the forward process (Equation 2) is rewritten as follows,

(6)

where and . We can see that is sent to via a single orthogonal transformation (corresponding to the first term of the right-hand side) One notable property is that any orthogonal transformation

preserves the length of vectors:

. This means that through the path formed by skip-connections, the norm of the vector is maintained.

The backward process (Equation 3) becomes

(7)

Again, there is a path, formed by skip-connections and behaving like a single orthogonal transformation layer, where the gradient with respect to is sent to the gradient with respect to no matter how many building blocks there are between and , and the norm of the gradient is maintained.

Conversion to identity transformations. We show that the orthogonal transformation can be absorbed into the regular connection and the skip-connection is reduced to an identity transformation, which is illustrated in Figure 1.

Theorem 1.

For a network, which is formed with orthogonal transformations as skip-connections, there exists another network,which is formed with identity transformations as skip-connections, such that, given an arbitrary input , the final outputs of the two networks are the same.

Proof.

We prove the theorem by considering the networks where the number of channels are not changed). For the networks with the number of channels changed, the proof is a little complex but similar.

The network with orthogonal transformations ( building blocks) is mathematically formed as below,

(8)
(9)

We construct a network with identity transformations,

(10)
(11)

satisfying

(12)
(13)

It can be easily verified that given an arbitrary input , the final outputs of the two networks are the same: .

Thus, the theorem holds. ∎

3.3 Idempotent transformations

An idempotent transformation is a linear transformation in which the transformation matrix is an idempotent matrix. An idempotent matrix is a matrix which, when multiplied by itself, yields itself.

Defnition 1 (Idempotent matrix).

The matrix is idempotent if and only if , or equivalently , where is a positive integer.

The forward process (Equation 2) becomes

(14)

where when , and otherwise. This implies that is directly sent to through skip-connections that behave like a single skip-connection.

Similarly, the backward process (Equation 3) becomes

(15)

which again implies that the gradient with respect to is directly sent to the gradient with respect to through the skip-connections.

Information maintenance. Different from identity transformations and orthogonal transformations, idempotent transformations maintain the vector lying in the column space of :

(16)

where and is an arbitrary -dimensional vector.

Apparently, when a vector lies in the null space of the column space, i.e., , it looks that the skip connections do not help to improve information flow. Considering another term in the right-hand side of Equation 14,

(17)

we can find , if is formed by convolutional layers and ReLU layers, such that

(18)

which means that there is a path along which the information does not vanish. Vanishing is still possible though rare, e.g., in the case lies in the null space of . There is similar analysis for gradient maintenance.

Diagonalization. It is known that an idempotent matrix is diagonalizable: , where is a diagonal matrix whose diagonal entries are or and is invertible. We illustrate it in Figure 1. A network containing blocks formed with idempotent transformations written as follows,

(19)
(20)

can be transferred to a network with skip connections formed by linear transformations whose transformation matrix is a diagonal matrix composed of and :

(21)
(22)
(23)

3.4 Discussions

We can easily show that identity transformations are idempotent and orthogonal transformations as (idempotent) and (orthogonal). Here we discuss a bit more on feature reuse, gradient back-propagation, and extensions.

Feature reuse. We have generalized identity transformations to orthogonal transformations and idempotent transformations, to eliminate information vanishing and explosion. One point we want to make clearer is that Equations 46 and 14 (for forward propagation) hold for any , . In other words, reuses all the previous features: rather than only . We have similar observation on gradient reuse.

Gradient back-propagation. Considering the network without skip-connections, back-propagating the gradient through regular connections yields the gradient with respect to : . With linear transformations as skip-connections, . One reason for gradient vanishing () means that lies in the null space of . Adding a proper to each in some sense shrinks the null space, and reduces the chance of gradient vanishing. It is as expected that a transformation with higher-rank leads to lower chance of gradient vanishing. We empirically justify it in Figure 2.

Extension of idempotent transformations. We extend idempotent transformations: , by relaxing the diagonal entries (eigenvalues) in . The relaxed conditions are (i) the absolute values of diagonal entries are not larger than and (ii) there is as least one diagonal entry whose absolute value is . Considering a special case that the absolute values of eigenvalues are only or , the absolute vector in the column space of is maintained is: . A typical example is a periodic matrix: , where is a positive integer.

3.5 Multi-branch networks

The multi-branch networks have been studied in many recent works AbdiN16 ; TargAL16 ; XieGDTH16 ; ZhaoWLTZ16 . We study the application of the two linear transformations to multi-branch structures.

Orthogonal transformations. Theorem 1 shows how orthogonal transformations are transformed to identity transformations. and still holds for multi-branch structures. Figure 1 depicts the regular connection in the block converted from a block with an orthogonal transformation for multiple branches. We can see that the converted regular connection cannot be separated into multiple branches (as shown in Figure 1) because of two extra transformations: pre-transformation (shortened as in Figure 1) and post-transformation (shortened as in Figure 1). The two transformations are essentially convolutions, which exchange the information across the branches. Without the two transformations (e.g., in residual networks using identity transformation), there is no interaction across these branches.

Idempotent transformations. We have shown that idempotent transformations can be transformed to diagonalized idempotent transformations. There are two extra transformations in regular connections: pre-transformation and post-transformation (see Figure 1). But the diagonal entries of the diagonal idempotent matrix are and . Compared to identity transformations, the regular connection contains two extra convolutions, which is similar to orthogonal transformations and results in information exchange across the branches.

4 Experiments

4.1 Datasets

CIFAR. CIFAR- and CIFAR- Alex2009 are subsets of the million tiny image database TorralbaFF08 . Both datasets contain color images with training images and testing images. The CIFAR- dataset includes classes, each containing images, for training and for testing. The CIFAR- dataset includes classes, each containing images, for training and for testing. We follow a standard data augmentation scheme widely used for this dataset HeZRS16 ; HuangLW16a ; HuangSLSW16 ; LeeXGZT15 ; LinCY13 ; XieGDTH16

: We first zero-pad the images with

pixels on each side, and then randomly crop them to produce

images, followed by horizontally mirroring half of the images. We normalize the images by using the channel means and standard deviations.

SVHN. The Street View House Numbers (SVHN) dataset is obtained from house numbers in Google Street View images. It contains training images, testing images and additional training images. Following HuangSLSW16 ; LeeXGZT15 ; LinCY13 , we select out samples per class from the training set and samples from the additional set, and use the remaining images as the training set without any data augmentation.

4.2 Setup

Networks. The network starts with a convolutional layer, stages, where each stage contains building blocks and there are two downsampling layers, and a global pooling layer followed by a fully-connected layer outputting the classification result.

In our experiments, we consider kinds of regular connections forming building blocks: (a) single branch, (b)

branches, and (c) depthwise convolution (an extreme multi-branch connection, each branch contains one channel). Each branch consists of batch normalization, convolution, batch normalization, ReLU and convolution (BN ), which is similar to the pre-activation residual connection 

HeZRS16ECCV . We empirically study two idempotent transformations and two orthogonal transformations and compare them with identity transformations.

Idempotent transformations: The first one is a merge-and-run style ZhaoWLTZ16 ; ZhaoWLTZ17 , denoted by Idempotent-MR

(24)

which is a block matrix containing blocks with being the number of branches, and each block is an identity matrix. The second one is obtained by subtracting from the identity matrix:

(25)

which is named as Idempotent-CMR (c=complement). The ranks of the two matrices are and , where is the size of the matrix or the total number of channels.

Orthogonal transformations: The first one is built from Kronecker product: , where is the Kronecker product operation, and

(26)

We name it Orthogonal-TP. The second one is a random orthogonal transformation, named Orthogonal-Random, also constructed using Kronecker product. In each block, we generate different orthogonal transformations.

Figure 2: Illustrating how the CIFAR- classification accuracy changes when changing the rank of the idempotent matrix over a -layer network. (changing when designing ). corresponds to that there is no skip-connection.

Training.

We use the SGD algorithm with the Nesterov momentum to train all the networks for

epochs on CIFAR-/CIFAR- and epochs on SVHN, both with a total mini-batch size . The initial learning rate is set to , and is divided by at and of the total number of training epochs. Following residual networks HeZRS15 , the weight decay is , the momentum is , and the weights are initialized as in residual networks HeZRS15

. Our implementation is based on Keras and TensorFlow 

AbadiABBCCCDDDG16 .

4.3 Results

Single-branch. We compare four skip-connections: Identity transformation, Idempotent-CMR, Orthogonal-TP and Orthogonal-Random. To form the idempotent matrix for Idempotent-CMR, we set to be the number of the channels, i.e., is a matrix with all entries being . We do not evaluate Idempotent-MR because in this case its rank is only , whose performance is expected to be low. In general, idempotent transformations with lower ranks perform worse than those with higher ranks. This is empirically verified in Figure 2.

Table 1 shows the results over networks of depth and , containing and building blocks, respectively. One can see that the results with idempotent transformations and orthogonal transformations are similar to those with identity transformations: empirically demonstrating that idempotent transformations and orthogonal transformations improve information flow.

Depth Mapping Width Accuracy
CIFAR- CIFAR- SVHN
20 Identity 16,32,64
Idempotent-CMR 16,32,64
Orthogonal-TP 16,32,64
Orthogonal-Random 16,32,64
56 Identity 16,32,64
Idempotent-CMR 16,32,64
Orthogonal-TP 16,32,64
Orthogonal-Random 16,32,64
Table 1: Comparing classification accuracies of identity transformations, idempotent transformations, and orthogonal transformations for single-branch regular connections. : is the average accuracy over five runs and is the standard deviation.
Depth Mapping Width Accuracy
CIFAR- CIFAR- SVHN
20 Identity 32,64,128
Idempotent-CMR 32,64,128
Idempotent-MR 32,64,128
Orthogonal-TP 32,64,128
Orthogonal-Random 32,64,128
32 Identity 32,64,128
Idempotent-CMR 32,64,128
Idempotent-MR 32,64,128
Orthogonal-TP 32,64,128
Orthogonal-Random 32,64,128
56 Identity 32,64,128
Idempotent-CMR 32,64,128
Idempotent-MR 32,64,128
Orthogonal-TP 32,64,128
Orthogonal-Random 32,64,128
110 Identity 32,64,128
Idempotent-CMR 32,64,128
Idempotent-MR 32,64,128
Orthogonal-TP 32,64,128
Orthogonal-Random 32,64,128
Table 2: Comparing classification accuracies for multiple-branch regular connections. The width of each branch is the same and the total width is described in the table.

Four-branch. We compare the results over the networks, where each regular connection consists of four branches. The results are shown in Table 2. One can see that the idempotent and orthogonal transformations perform better than identity transformations. The reason is that compared to identity transformations, the designed idempotent and orthogonal transformations introduce interactions across the four branches.

Depth-wise. We evaluate the performance over extreme multi-branch networks: depth-wise networks, where each branch only contains a single channel. Table 3 shows the results. One can see that the comparison is consistent to the -branch case.

Depth Mapping Width Accuracy
CIFAR- CIFAR- SVHN
20 Identity 64, 128,256
Idempotent-CMR 64,128,256
Orthogonal-TP 64,128,256
Orthogonal-Random 64,128,256
56 Identity 64,128,256
Idempotent-CMR 64,128,256
Orthogonal-TP 64,128,256
Orthogonal-Random 64,128,256
Table 3: Comparing classification accuracies for depth-wise networks.

5 Conclusions

We introduce two linear transformations, orthogonal and idempotent transformations, which, we show theoretically and empirically, behave like identity transformations, improving information flow and easing the training. One interesting point is that the success stems from feature and gradient reuse through the express way composed of skip-connections, for maintaining the information during flow and eliminating the gradient vanishing problem.

References