AutoShuffleNet: Learning Permutation Matrices via an Exact Lipschitz Continuous Penalty in Deep Convolutional Neural Networks

01/24/2019 ∙ by Jiancheng Lyu, et al. ∙ 0

ShuffleNet is a state-of-the-art light weight convolutional neural network architecture. Its basic operations include group, channel-wise convolution and channel shuffling. However, channel shuffling is manually designed empirically. Mathematically, shuffling is a multiplication by a permutation matrix. In this paper, we propose to automate channel shuffling by learning permutation matrices in network training. We introduce an exact Lipschitz continuous non-convex penalty so that it can be incorporated in the stochastic gradient descent to approximate permutation at high precision. Exact permutations are obtained by simple rounding at the end of training and are used in inference. The resulting network, referred to as AutoShuffleNet, achieved improved classification accuracies on CIFAR-10 and ImageNet data sets. In addition, we found experimentally that the standard convex relaxation of permutation matrices into stochastic matrices leads to poor performance. We prove theoretically the exactness (error bounds) in recovering permutation matrices when our penalty function is zero (very small). We present examples of permutation optimization through graph matching and two-layer neural network models where the loss functions are calculated in closed analytical form. In the examples, convex relaxation failed to capture permutations whereas our penalty succeeded.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Light convolutional deep neural networks (LCNN) are attractive in resource limited conditions for delivering high performance at low costs. Some of the state-of-the-art LCNNs in computer vision are ShuffleNet (

[20], [13]), IGC (Interleaved Group Convolutions, [19]) and IGCV3 (Interleaved Low-Rank Group Convolutions,[15]). A noticeable feature in the design is the presence of permutations for channel shuffling in between separable convolutions. The permutations are hand-crafted by designers outside of network training however. A natural question is whether the permutations can be learned like network weights so that they are optimized based on training data. An immediate difficulty is that unlike weights, permutations are highly discrete and incompatible with the stochastic gradient descent (SGD) methodology that is continuous in nature. To overcome this challenge, we introduce an exact Lipschitz continuous non-convex penalty so that it can be incorporated in SGD to approximate permutation at high precision. Consequently, exact permutations are obtained by simple rounding at the end of network training with negligible drop of classification accuracy. To be specific, we shall work with ShuffleNet architecture ([20], [13]). Our approach extends readily to other LCNNs.

Related Work. Permutation optimization is a long standing problem arising in operations research, graph matching among other applications [7, 3]. Well-known examples are linear and quadratic assignment problems [16]. Graph matching is a special case of quadratic assignment problem, and can be formulated over permutation matrices as: , where and are the adjacency matrices of graphs with vertices, and is the Frobenius norm. A popular way to handle is to relax it to the Birkhoff polytope , the convex hull of , leading to a convex relaxation. The problem may still be non-convex due to the objective function. The explicit realization of is the set of doubly stochastic matrices , where . An approximate yet simpler way to treat is through the classical first order matrix scaling algorithm e.g. the Sinkhorn method [14]. Though in principle such algorithm converges, the cost can be quite high when iterating many times, which causes a bottleneck effect [11]. A non-convex and more compact relaxation of is by a sorting network [11] which maps the box into a manifold that sits inside and contains . Yet another method is path following algorithm [18] which seeks solutions under concave relaxations of

by solving a linear interpolation of convex-concave problems (starting from the convex relaxation). None of the existing relaxations are exact.

Contribution. Our non-convex relaxation is a combination of matrix penalty function and . The (the difference of and

norms) has been proposed and found effective in selecting sparse vectors under nearly degenerate linear constraints

[5, 17]. The matrix version is simply a sum of over all row and column vectors. We prove that the penalty is zero when applied to a matrix in if and only if the matrix is a permutation matrix. Thanks to the constraint, the penalty function is Lipschitz continuous (almost everywhere differentiable). This allows the penalty to be integrated directly into SGD for learning permutation in LCNNs. As shown in our experiments on CIFAR-10 and Imagenet data sets, the closeness to turns out to be remarkably small at the end of network training so that a simple rounding has negligible effect on the validation accuracy. We also found that convex relaxation by fails to capture good permutations for LCNNs. To our best knowledge, this is the first time permutations have been successfully learned on deep CNNs to improve hand-crafted permutations.

Outline. In section 2, we introduce our exact permutation penalty, and prove the closeness to permutation matrices when the penalty values are small as observed in the experiments. We also present the training algorithm combining thresholding and matrix scaling to approximate projection onto for SGD. In section 3, we analyze three permutation optimization problems to show the necessity of our penalty. In a 2-layer neural network regression model with short cut (identify map), convex relaxation does not give the optimal permutation even with additional rounding while our penalty can. In section 4, we show experimental results on consistent improvement of auto-shuffle over hand-crafted shuffle on both CIFAR-10 and Imagenet data sets. Conclusion is in section 5.

2 Permutation, Matrix Penalty and Exact Relaxation

The channel shuffle operation in ShuffleNet [20, 13] can be represented as multiplying the feature map in the channel dimension by a permutation matrix . The permutation matrix is a square binary matrix with exactly one entry of one in each row and each column and zeros elsewhere. In the ShuffleNet architecture [20, 13], is preset by the designers and will be called “manual”. In this work, we propose to learn an automated permutation matrix through network training, hence removing the human factor in its selection towards a more optimized shuffle. Since permutation is discrete in nature and too costly to enumerate, we propose to approach it by adding a matrix generalization of the penalty [5, 17] to the network loss function in the stochastic gradient descent (SGD) based training.

Specifically for , the proposed continuous matrix penalty function is


in conjunction with the doubly stochastic constraint:

Remark 1.

When the constraints in (2) hold, and in can be removed. However, in actual computation, the two equality constraints of (2) only hold approximately, so the full expression in (1) is necessary.

Remark 2.

Thanks to (2), we see that the penalty function is actually Lipschitz continuous in as , , and , .

Theorem 1.

A square matrix is a permutation matrix if and only if , and the doubly stochastic constraint (2) holds.


() Since a permutation matrix consists of columns (rows) with exactly one entry of 1 and the rest being zeros, each term inside the outer sum of equals zero, and clearly (2) holds.

() By the Cauchy-Schwarz inequality,

with equality if and only if the row-wise cardinalty is 1:


This is because the mixed product terms like () in must all be zero to match . This only happens when equation (3) is true. Likewise,

with equality if and only if . In view of (2), is a permutation matrix. ∎

The non-negative constraint in (2) is maintained throughout SGD by thresholding . The normalization conditions in (2) are implemented sequentially once in SGD iteration. Hence they are not strictly enforced. In theory, if the column normalization (divide each column by its sum) and row normalization (divide each row by its sum) are iterated sufficiently many times, the resulting matrices converge to (2). This is known as the Sinkhorn process or RAS method [14], which is a first order method to approximately solve the so called matrix scaling problem (MSP). Simply state, the MSP for a given non-negative real matrix is to scale its rows and columns (i.e. multiply each by a non-negative constant) to realize the prescribed row sums and column sums. The approximate MSP is: given tolerance , find positive diagonal matrices and such that For a historical account of MSP and a summary of various algorithms to date, see [2] and Table 1 therein. The RAS method is an alternate minimization procedure with convergence guarantees. Each iteration of the RAS method costs complexity , being the number of non-zero entries in . If the entries of are polynomially bounded (which is the case during network training due to the continuous nature of SGD), the RAS method converges in iterations [6], giving total complexity , where tilde hides logarithmic factors in and . Improvements of complexity bounds via minimizing the log of capacity and higher order methods can be found in [2]. However for our study, the first order method [14] suffices for two reasons. One is that it is computationally low cost, the other is that the error in the matrix scaling step can be compensated in network weight adjustment during SGD. In fact, we did not find much benefit to iterate RAS method more than once in terms of enhancing validation accuracy. This is quite different from solving MAP as a stand alone task.

The multiplication by can be embedded in the network as a convolution layer with initialized as absolute value of a random Gaussian matrix. After each weight update, we threshold the weights to , normalize rows to unit lengths, then repeat on columns. Let be the network loss function. The training minimizes the objective function:


where is the total number of “channel shuffles” ’s abbreviated as , is the network weight, a positive parameter. The training algorithm is summarized in Algorithm 1. The term in the penalty function has standard sub-gradient, and the term is differentiable away from zero, which is guaranteed in the algorithm 1 due to continuity of SGD and the normalization in columns and rows. is chosen to be or so as to balance the contributions of the two terms in (4) and drive close to .

mini-batch loss function , being the iteration index;
learning rate for ;
penalty parameter for ;
total iteration number .

: sample from unit Gaussian distribution;

: sample from unit Gaussian distribution then take absolute value.

  (1) Evaluate the mini-batch gradient at ;
  (2) ;   gradient update for weights
  (3) ;   gradient update for
  (4) ;   thresholding to enforce non-negativity constraint
  (5) normalize each column of by dividing the sum of entries in the column;
  (6) normalize each row of by dividing the sum of entries in the row. END WHILEOutput: ,; project each matrix inside to the nearest permutation matrix.
Algorithm 1 AutoShuffle Learning.

We shall see that the penalty indeed gets smaller and smaller during training. Here we show a theoretical bound on the distance to when is small and (2) holds approximately.

Theorem 2.

Let the dimension of a non-negative square matrix be fixed. If , , and the doubly stochastic constraints are satisfied to , then there exists a permutation matrix such that .


It follows from that

implying that:


On the other hand for :


Let , at any . It follows from (6) that

and from (5) that

Hence each row of is close to a unit coordinate vector, with one entry near 1 and the rest near 0. Similarly from , and , we deduce that each column of is close to a unit coordinate vector, with one entry near 1 and the rest near 0. Combining the two pieces of information above, we conclude that is close to a permutation matrix. ∎

The learned non-negative matrix will be called a relaxed shuffle and will be rounded to the nearest permutation matrix to produce a final auto shuffle. Strictly speaking, this “rounding” involves finding the orthogonal projection to the set of permutation matrices, a problem called the linear assignment problem (LAP), see [1]

and references therein. The LAP can be formulated as a linear program over the doubly stochastic matrices or constraints (

2), and is solvable in polynomial time [1]. As we shall see later in Table 3, the relaxed shuffle comes amazingly close to an exact permutation in network learning. Hence, it turns out unnecessary to solve LAP exactly, indeed a simple rounding will do. AutoShuffleNet units adapted from ShuffleNet v1 [20] and ShuffleNet v2 [13] are illustrated in Figs. 2-2.

Figure 1: AutoShuffleNet units based on ShuffleNet v1.
Figure 2: AutoShuffleNet units based on ShuffleNet v2.

3 Permutation Problems Unsolvable by Convex Relaxation

The doubly stochastic matrix condition (

2) is a popular convex relaxation of permutation. However, it is not powerful enough to enable auto-shuffle learning as we shall see later. In this section, we present examples from permutation optimization to show the limitation of convex relaxation (2), and how our proposed penalty (1) can strength (2) to retrieve permutation matrices.

Let us recall the graph matching (GM) problem, see [16, 11, 1, 12, 18] and references therein. The goal is to align the vertices of two graphs to minimizes the number of edge disagreements. Given a pair of -vertex graphs and , with respective adjacency matrices and , the GM problem is to find a permutation matrix to minimize . Let be the set of all permutation matrices, solve


By algebraic identity:

the GM problem (7) is same as:


a quadratic assignment problem (QAP). The general QAP for two real square matrices and is [16, 11]:


The convex relaxed GM is:


As an instance of general QAP, let us consider problem (7) in case for two real matrices:

If , then:

and equals ( ):

Example 1: We show that by selecting:

which is convex on and has minimum at . The convex relaxed matrix solution is:

however, the permutation matrix solution to GM problem (7) is the identity matrix at .

In the spirit of objective function (4), let us minimize:

or equivalently minimize (after skipping some additive constants in ):

An illustration of is shown in Fig. 3. The minimal point moves from the interior of the interval when (dashed line, top curve) to the end point 1 as increases to 1 (line-star, middle curve) and remains there as further increases to 2 (line-circle, bottom curve). So for , is recovered with our proposed penalty. ∎

Figure 3: The function as penalty parameter varies from 0.25 (interior minimal point, dashed line, top) to 1 (line-star, middle) and 2 (line-circle, bottom). Minimal point occurs at in the latter two curves.

Example 2: Consider the adjacent matrix () of an un-directed graph of 2 nodes and 1 edge (with a loop at node 1). An edge adds 1 and a loop adds 2 to an adjacent matrix.


So . The regularized objective function (modulo additive constants) is:

with . In view of:

two possible interior critical points are:


Since , if , the second equality in (11) is ruled out. Comparing with , we see that the global minimal point does not occur at if

Hence if , minimizing recovers . ∎

In Fig. 4, we show that two minimal points of occur in the interior of when , and transition to , at . When , or , the second equality in (11) cannot hold, becomes convex with a unique minimal point at .

Figure 4: The function as penalty parameter varies from 1.8 (solid line, top) to 1.9 (dot, middle) and 2 (line-circle, bottom) where minimal points occur at . Interior minimal points occur on when .
Remark 3.

We refer to [12] on certain correlated random Bernoulli graphs where . On the other hand, there is a class of friendly graphs [1] where . Existing techniques to improve convex relaxation on GM and QAP include approximate quadratic programming, sorting networks and path following based homotopy methods [16, 11, 18]. Our proposed penalty (1)-(2) appears more direct and generic. A detailed comparison will be left for a future study.

Remark 4.

In example 1 above, if the convex relaxed is rounded up to 1, then . In example 2 (Fig. 4), the two interior minimal points at , after rounding down (up), become zero or one. So convex relaxation with the help of rounding still recovers the exact permutation. We show in example 3 below that convex relaxation still fails after rounding (to 1 if the number is above 1/2, to 0 if the number is below 1/2).

Example 3: We consider the two-layer neural network model with one hidden layer [10]. Given , the forward model is the following function:


is the ReLU activation function,

is the input random vector drawn from a probability distribution,

is the weight matrix, is the identity matrix. Consider a two-layer teacher network with weight matrix

We train the student network with doubly stochastic constraint on using the loss:

Let ,

We write the loss function as


where for , is defined as

Define then and

For simplicity, let

obey uniform distribution on

. For , , , equals


We have


where . The last term in (12) is a constant:


Consider a special case when , , and . By (14)-(16), the loss function equals

Let When , , Fig. 5 (top left) shows has minimal points in the interior of . A permutation matrix that minimizes can be achieved by rounding the minimal points. However, when , , (Fig. 5, top right), rounding the interior minimal point of gives the wrong permutation matrix at . At , the regularization selects the correct permutation matrix.

Figure 5: ( left, right) as penalty parameter varies for the uniformly distributed input data on .
Remark 5.

If obeys the unit Gaussian distribution as in [10], the functions are more complicated analytically, however their plots resemble those for uniformly distributed , see Fig. 6.

Figure 6: ( left, right) as penalty parameter varies for unit Gaussian input data on .

4 Experiments

We relax the shuffle units in ShuffleNet v1 [20] and ShuffleNet v2 [13] and perform experiments on CIFAR-10 [8] and ImageNet [4, 9] classification datasets. The accuracy results of auto shuffles are evaluated after the relaxed shuffles are rounded.

On CIFAR-10 dataset, we set the penalty parameter . All experiments are randomly initialized with learning rate linearly decaying from . We train each network for epochs. We set weight-decay , momentum and batch size . The experiments are carried out on a machine with single Nvidia GeForce GTX 1080 Ti GPU. In Table 1, we see that auto shuffle consistently improves by as much as 1% on manual shuffle in both v1 and v2 models of ShuffleNet.

Networks Manual Auto
ShuffleNet v1 (g=3) 90.55 91.76
ShuffleNet v1 (g=8) 90.06 91.26
ShuffleNet v2 (1) 91.90 92.81
ShuffleNet v2 (1.5) 92.56 93.22
Table 1: CIFAR-10 validation accuracies.

On ImageNet dataset, we set the penalty parameter . For each experiment, the training process includes two training cycles: the first cycle is randomly initialized with learning rate starting at and the second one is resumed from the first one with learning rate starting at . Each cycle consists of epochs and the learning rate decays linearly. We set weight-decay , momentum and batch size . The experiments are carried out on a machine with 4 Nvidia GeForce GTX 1080 Ti GPUs. In Table 2, we see that auto shuffle again consistently improves on manual shuffle for both v1 and v2 models.

Networks Manual Auto
ShuffleNet v1 (g=3) 65.50 65.62
ShuffleNet v1 (g=8) 65.18 65.76
ShuffleNet v2 (1) 67.46 67.69
ShuffleNet v2 (1.5) 70.58 70.60
Table 2: ImageNet top-1 validation accuracies.

The permutation matrix of the first shuffle unit in ShuffleNet v1 (g=3) is a matrix of size , which can be visualized in Fig. 7 (manual, left) along with an auto shuffle (right). The dots (blanks) denote locations of 1’s (0’s). The auto shuffle looks disordered while the manual shuffle is ordered. However, the inference cost of auto shuffle is comparable to manual shuffle since the shuffle is fixed and stored after training.

Figure 7: Permutation matrices of the first shuffle unit in ShuffleNet v1 (g=3) of manual shuffle (left) and auto shuffle (right). The auto shuffle is trained on CIFAR-10 dataset. The dots (blanks) indicate locations of 1’s (0’s). The auto shuffle looks disordered while the manual shuffle is ordered. However, the inference cost of auto shuffle is comparable to manual shuffle in inference.

The accuracy drop due to rounding to produce auto shuffle from relaxed shuffle is indicated by relative error in Table 3. On CIFAR-10 dataset, negligible drop is observed for ShuffleNet v1. Interestingly, rounding even gained accuracy for ShuffleNet v1 on ImageNet dataset.

Dataset Networks Relative Error
CIFAR-10 ShuffleNet v1 (g=3) 0
ShuffleNet v1 (g=8) 0
ShuffleNet v2 (1) -2.15E-4
ShuffleNet v2 (1.5) -1.07E-3
ImageNet ShuffleNet v1 (g=3) +6.10E-5
ShuffleNet v1 (g=8) +3.04E-5
ShuffleNet v2 (1) 0
ShuffleNet v2 (1.5) -2.83E-5
Table 3: Relative error of accuracy of rounding relaxed shuffle. The -/+ refer to accuracy drop/gain after rounding to produce auto shuffle from relaxed shuffle.

The penalty of ShuffleNet v1 (g=3) is plotted in Fig. 8. As the penalty decays, the validation accuracy of auto shuffle (after rounding) becomes closer to relaxed shuffle (before rounding), see Fig. 9.

Figure 8: Training loss and penalty of ShuffleNet v1 (g=3) with relaxed shuffle on CIFAR-10.

Figure 9: Validation accuracy of ShuffleNet v1 (g=3) with relaxed shuffle (before rounding) and auto shuffle (after rounding) on CIFAR-10. The rounding error becomes smaller during training.

To demonstrate the significance of the regularization, we also tested auto shuffle with various on ShuffleNet v1 (g=3). Table 4 shows that the accuracy drops much after the relaxed shuffle is rounded. We plot the stochastic matrix of the first shuffle unit of the network at and respectively in Fig. 10. The penalty is large when is relatively small, indicating that the stochastic matrices learned are not close to optimal permutation matrices.

0 1E-5 1E-4 5E-4 1E-3
relaxed 90.00 90.18 90.48 91.45 91.76
auto 10.00 38.18 11.37 71.50 91.76
penalty 3.37E3 1.59E3 4.95E2 3.13E-1 5.07E-2
Table 4: CIFAR-10 validation accuracies of ShuffleNet v1 (g=3) with relaxed shuffle (before rounding) and auto shuffle (after rounding), and penalty values of relaxed shuffle at various ’s. The penalty and rounding error tends to zero as increases.
Figure 10: Stochastic matrices of the first shuffle unit in ShuffleNet v1 (g=3) with relaxed shuffle before rounding at (left) and (right). The relaxed shuffle is trained on CIFAR-10 dataset. The matrices are quite diffusive and not close to optimal permutation matrices when is relatively small.

5 Conclusion

We introduced a novel, exact and Lipschitz continuous relaxation for permutation and learning channel shuffling in ShuffleNet. We showed through a regression problem of a 2-layer neural network with short cut that convex relaxation fails even with additional rounding while our relaxation is precise. We plan to extend our work to auto-shuffling of other LCNNs and hard permutation problems in the future.

6 Acknowledgement

The work was partially supported by NSF grant IIS-1632935.